The Safety Theatre of Agentic AI
Or: how we learned to stop worrying and deploy the benchmark
In January, researchers at Carnegie Mellon and Fujitsu presented FieldWorkArena at the AAAI conference in Singapore. A benchmark designed to measure whether AI agents are safe enough to field in live industrial settings. Factories. Warehouses. Places where the wrong answer doesn’t embarrass the product manager but puts someone in a sling.
FieldWorkArena uses real-world data: work manuals, safety regulations, video from factory floors. It checks whether an agent can spot a PPE violation, generate an incident report, cross-reference compliance standards. Rigorous methodology. The researchers understand the stakes.
The systems it was designed to evaluate were already running in October 2025.
The infrastructure arrives first, with a business case. The safety validation arrives afterward, in a paper, with caveats about generalisability. Good science arriving after the deployment: the inspection report filed while the building is already occupied and the exits are unclear.
Anyone who has spent time in safety-critical engineering will recognise the pattern below. The thing ships, then the people whose job is to worry about what ships discover what has already shipped.
Alia found the error six weeks ago in a logistics depot in the East Midlands.
Not by looking for it. By the consignment that got flagged: a pharmaceutical shipment, cross-border, held at customs. A quiet Friday afternoon working backwards through the documentation trail.
It took her four hours. Her first thought was not what went wrong. It was: how long had it been functioning correctly?
Alia is a compliance lead at a mid-size logistics firm. Eleven years in the role. Since March, the company has run an agentic AI system across operational coordination: route optimisation, exception handling, customs documentation. It handles the 80% of cases that are variants of known patterns, running faster and with fewer errors than her team managed at scale. A colleague with dyslexia has quietly flourished since the documentation work migrated to the agent; years of masking the cognitive load of dense customs forms, gone. She would not reverse the rollout.
But in August, a narrow UK customs regulation took effect. Specific pharmaceutical classifications. Specific border configurations. The agent possessed no training data from the post-regulation environment. It encountered the new requirements and did what agents do: found the closest pattern and continued. It generated compliant-looking documentation containing a systematic error. Not random. Not detectable by spot check because the error was internally consistent. The agent was not making mistakes. It was following rules that no longer corresponded to the world.
Six weeks of immaculate documentation. Then a flagged consignment: customs hold, potential licence exposure, the kind of outcome that gets a chief officer attached to the question.
The question was not whether the vendor would fix it. The vendor had a roadmap. The question was what else the system was doing at the edges of training envelopes nobody had mapped. Alia’s question now has a budget line and a chief officer attached to it.
In a manufacturing group outside Rotterdam, Bel Riose has been doing arithmetic he does not like.
He is the Chief Transformation Officer. He owns the AI investment alongside the slide deck that promised the board 23% efficiency gains. The 23% was not fabricated. It emerged from a serious pilot at one facility: controlled scope, a dedicated engineering team on-site for six months. It worked. The board asked when this could scale to all fourteen facilities.
Six are running now. Two have experienced anomalous behaviour severe enough to require human intervention. Wasteful in ways that don’t appear cleanly in the efficiency metrics because the waste distributes across small decisions made at speed: a misrouted pallet here, a substituted component approved without the right sign-off there. The aggregate numbers still look acceptable. The 23% remains plausible if you average across the functioning facilities and decline to examine the others closely. Which is what quarterly presentations are for.
What Riose understands now, in a way he only suspected then: the pilot worked because of the engineering team. The agent was never the product. The agent plus the engineers doing continuous supervision was the product. When he scaled the agent without scaling the engineering capacity, he took the chassis of a car and left the steering on the factory floor.
This realisation does not belong to him alone. It belongs to every rollout currently running on the back of a pilot that included components the rollout does not.
In Hamburg, Yevgenia Orlova has been in this product meeting for forty minutes. The room is sunny. There are pastries. The agenda has six items. Item four is whether the tool they are about to ship might, in specific circumstances, produce consequences that would be difficult to explain to a magistrate.
She runs the AI safety team at the industrial automation company building the products Alia’s firm and Riose’s group deploy. She arrived from robotics safety, a field where “breakdown class” still carries the faint odour of litigation. Her team has documented the problem with the care of people who understand that documentation is what you produce before someone asks to see it in discovery. They are proposing a constraint on deployment scope or a fix that slows the release schedule by three weeks.
The product lead explains the competitive landscape. He is not wrong about the facts. The competitor ships without these constraints. The customer expects the release on time. These facts produce a conclusion the room translates into corporate physics: they are proactively managing downstream risk profiles. They are shipping a product with a known breakdown class and will address the consequences in a later quarter. With a different set of charts.
Nobody is lying. The pastries are excellent.
The meeting ends on time. Somewhere outside Stuttgart, the thing they just approved is already making decisions at a speed that makes intervention impossible. Nobody is watching. The monitoring dashboard is a line item on the Q4 engineering backlog. Nobody in the room held the authority to make monitoring a shipping condition, and nobody asked who did.
The Stanford AI Index, published in April 2026, documents the acceleration. Scores on Humanity’s Last Exam have risen from 8.8% to approaching 50% in fifteen months. The exam was designed to test the outer edge of then-current capability, a ceiling that no longer holds. The Index, in the same breath, notes carefully that “we generally lack measures of how well a system needs to function in a particular setting.”
We have watched the score climb, and we still cannot tell you what the score means when the setting is a pharmaceutical shipment at a border crossing in August.
Better at the test. Unknown on the job. The distance between them is where Alia lives now.
Alia has raised the customs error with the vendor. The vendor has been responsive. There is a roadmap.
She has also started keeping a parallel log. Not because anyone asked. Because eleven years of understanding how systems fail teaches you to watch for the dysfunctions that don’t produce flagged consignments. The flagged consignment was the good outcome. Visible, traceable, fixable. What the log is for is the other category: malfunctions that are consistent, plausible, and invisible until a downstream consequence arrives that no longer traces back cleanly. She keeps it anyway. A thousand kilometres south-west, someone else is about to keep a different kind of promise to himself.
Bel Riose is going to give an honest Q3 presentation. He has decided this. It will be the most uncomfortable thing he has done in his professional career, which, across three restructurings and one acquisition that destroyed substantially more value than it created, is saying something.
The honest presentation will say the pilot worked because of the engineering team. The agent was not the product; the agent plus continuous supervision was the product. Scaling one without the other is a different proposition, and the shape of that difference is now visible across six facilities operating at production velocity. It will also say the 23% is achievable, but what it requires is not less AI but different human capacity alongside it: people who understand the architecture well enough to supervise it, a distinct skill from the people who previously did the thing the agent now does. You cannot train them faster than the rollout timeline demands. The gap is months, sometimes longer.
He does not know if the board will hear this, or merely listen to it.
What has changed for Yevgenia in the last six months is not the meeting. The meeting is the same. What has changed is the vocabulary available to her.
The FieldWorkArena benchmark has given her team something internal documentation never could: external, citable language that enters a product conversation without sounding like risk-aversion dressed as engineering principle. When she can say “this behaviour would fail the FieldWorkArena safety protocol on incident detection” rather than “I’m worried about this,” the conversation changes shape. A benchmark, it turns out, is also a permission slip.
Not always. But sometimes. More often than before.
She is also, quietly, in conversation with counterparts at two competitor companies. They are all observing the same breakdown signatures. They are all having versions of the same internal argument. Nobody has proposed anything formal, but something is crystallising in the space between these calls, the way a standard forms before anyone calls it one.
Yevgenia knows how safety standards actually form. They form after incidents, or just before the incident that would have been catastrophic enough to change everything. The art is making the near-miss legible enough to act on before the actual miss.
She is trying to make the near-misses legible. In meetings that end with the release schedule unchanged. In calls with competitors facing the same structural pressures. In documentation that might become the language for a standard that does not yet exist. She does not know if it will be enough.
The technology is running: partially, imperfectly, better than some alternatives, worse than the marketing suggested. The genuine value is real. Neither Alia nor Riose would reverse the rollout. That matters.
What none of the timelines budgeted for was the gap between the system functioning and the system functioning safely at scale. These are distinct conditions. The industry is largely doing verification (does it work as designed?) and calling it validation. What is needed is harder: does it work as needed, in this context, at this speed, against rules that keep changing. The distance between those questions is not mainly a technical problem. It is a structural one: the benchmark arrives after the rollout; safety concerns enter product conversations after the release schedule is set; the post-mortem arrives after the incident. The pattern is not a flaw. It is the operating logic.
The benchmark is not theatre. Yevgenia’s work is not theatre. Alia’s parallel log is not theatre.
The theatre is the institutional claim that because these things exist, the deployments are governed. That because the benchmark was presented, the rollout is validated. That because a safety team exists, its concerns are reflected in the product. That because we are constructing the frameworks, the frameworks are operational. The gap between what is claimed and what is true is not new. It has always been present in the deployment of complex systems. What AI adds is speed and illegibility: decisions made faster than oversight can follow, failures distributed across patterns that don’t resolve into a single flagged consignment until they do. What is changing is the proximity of consequences that make it visible to people outside the room.
The question worth sitting with is what it would take for the sequence to reverse, for validation to precede rollout at scale rather than trailing it.
History suggests a reliable answer. In 1956, two aircraft collided over the Grand Canyon, killing 128 people. Congress created the FAA two years later. NASA’s safety infrastructure was rebuilt in 1986, months after Challenger. Pharmaceutical manufacturing standards tightened after thalidomide. The dead were the argument. The living were the audience. The legislation followed.
We rarely do.
The choice keeps being available, right up until it isn’t.
The FieldWorkArena benchmark and related Carnegie Mellon / Fujitsu safety research was presented at AAAI 2026 in January. The Stanford AI Index 2026, published in April 2026, notes that AI benchmark performance continues to improve while measures of real-world safety and utility remain underdeveloped. These are the same observation.
Future Tense publishes every week. Paid subscribers get the analysis that goes deeper and the fiction that goes further.








