A company plugs AI into its document flow. The first week, an employee reads every answer carefully, makes edits, occasionally sends something back to be redone. A month in, they approve almost everything without looking, because the AI is usually right. On paper, the controls are there: a human is in the loop, the work is flowing, no errors in sight. In reality, oversight is dead — nobody's noticed yet. They'll notice when the first mistake slips out into the world and it turns out that, for weeks, no one could have caught it anyway.
This isn’t a scare story — it’s one of the quietest ways to gut the whole point of a human+AI pairing. In what follows we’ll unpack the kinds of pairings that actually exist, which tasks are worth handing to AI, where exactly human review breaks down, and why, without a digital trail, that oversight can’t be proven or fixed. We’ll come back to that department with the dead controls.
What this is about, and what it isn’t
Let me separate out three situations that constantly get lumped together.
Personal use. An employee helps themselves: asks the AI to rough out a draft, transcribe a call, find a formula. Useful, and widespread. But for the company this isn’t a managed process yet — it’s one person’s private initiative.
Shadow use. The same thing, but at scale and invisible to the company. People are already taking work tasks to ChatGPT, to local LLMs, to call transcribers and browser extensions, and the company can’t see any of it. And this is the key part: while AI works in the shadows, the company sees neither the data, nor the quality, nor the errors. This is really the first form of having no trail — work is already running through AI, but it’s invisible. For a lot of companies, this is the actual starting point: the question isn’t “should we adopt AI,” it’s “what’s already happening blind, and where does it create risk.” Banning it outright is usually pointless. The smarter move is to figure out where it already helps and where it’s dangerous, then gradually bring the useful parts out of the shadows and into proper rules.
The organized pairing. The company has deliberately built AI into a workflow. The AI does part of the work, a human owns the result that goes out — to a client, into a document, into a system. This is where all the money and all the landmines are. This article is mostly about this one.
The difference is fundamental. In personal use, the mistake usually stays internal: the person decides for themselves whether to trust the draft or rewrite it, and they bear the consequences. In an organized pairing, an AI mistake can go out into the world, and the question “who let this through” too often has no answer. Because the pairing was assembled, but who’s responsible for what within it was never agreed.
Why the savings leak into review
In the moment, AI often feels like a big win: a draft appears in a minute, a document gets parsed, an email gets assembled. But in a real workflow, savings are measured not at the moment of generation but by a finished result you’re not embarrassed to ship. And between the draft and that result sits review. That’s exactly where the win leaks out.
The measurements bear this out — though treat the numbers as a rough guide, since every study has its own sample.
First, people are bad at judging their own gains. METR, in 2026, specifically warns that self-reported estimates can substantially overstate the effect. When the same effect is measured from the outside, the numbers are more modest: in a St. Louis Fed study, generative-AI users reported saving on the order of 5% of their working time. There’s a wide gap between “I feel like I’m twice as fast” and the actual savings.
Second, review eats a big chunk of the gain directly. In a Foxit/Sapio study on document work (US and UK), the time spent reviewing all but wipes out the claimed savings: managers net around 16 minutes a week, and employees actually come out negative. The time saved on “doing” leaks into “checking after the AI.”
The takeaway isn’t “AI doesn’t work.” It’s this: a human+AI pairing doesn’t deliver results on its own — only if it’s set up right. Otherwise you’re paying twice — for the AI, and for the human now cleaning up after the AI.
Which pairing are you building
Before you count the savings, you need to know which pairing you’re actually building. “A human working with AI” isn’t one scenario — it’s at least three. People confuse them, and the workload and the risks are completely different in each.
Mode 1. AI assistant: the human leads. The human does the work; the AI suggests — a draft, a phrasing, an idea, a sanity check on an option. The decision and the outcome stay with the human. The workload is roughly the same as before, just faster on individual steps. When to use it: at the start, and on non-standard or creative tasks where judgment matters. Low risk.
Mode 2. AI does it, the human reviews. The AI makes the first pass over the flow; the human reviews, edits, approves. The most common working mode in business — and the most treacherous: this is where most of the traps live. The workload shifts from production to review — that’s a different kind of work and a different kind of fatigue. When to use it: on a repeatable flow with a moderate cost of error. Risk is medium and up, and it all hinges on whether the human is actually reviewing.
Mode 3. AI on autopilot, human on exceptions. The AI handles the flow itself; the human only deals with what the system flags as doubtful, plus spot-checks the rest. The human works not with the whole flow, but with its difficult tail. When to use it: on large volumes of uniform tasks with a clear cost of error — and only after Mode 2 has shown stable quality. Jumping straight here is almost guaranteed to mean missed errors. High risk if the triggers for calling in a human are poorly calibrated.
From our own work. Torgi, our tender-scouting agent (still a pilot), runs as a Mode 2–3 setup: the AI makes the entire first pass — finds the tenders, parses the documents, builds a dossier on the buyer, and for each one issues a verdict — “go for it / take a look / skip” — with reasoning, risks, and what data is missing. It doesn’t submit bids and it signs nothing; that’s a deliberate limit. The final call and the responsibility stay with the human, and the AI takes the first-pass grind off their plate, not the right to decide.
You’re not choosing whether to “adopt AI” — you’re choosing which mode a specific pairing runs in on a specific process. And the mode has to match the cost of error and the volume, not the fashion.
What to put into the pairing
Mode chosen — now the question is which task to put into it at all. Not every task belongs there, even if it’s technically possible.
Compare two cases. A draft reply to a client is easy to check; the mistake is usually visible and fixable. A legal clause is hard to check, and the mistake might surface months later and cost a fortune. These are different classes of task, even though both can be sent to an AI. Which leaves you with two questions: is it expensive to do versus expensive to check, and is the mistake visible.
First criterion: hand off tasks where checking is cheap and doing is expensive. If a human can size up the AI’s output at a glance but building it from scratch takes a while — the pairing wins. But if confirming the answer means effectively redoing the work, the pairing doesn’t speed things up, it slows them down. That’s exactly the situation where the gain gets eaten by review.
Second criterion: is the mistake visible. Cheap checking combined with an invisible, expensive mistake is a trap: the person thinks they glanced at it, but in fact they missed it.
| Checking | Cost of error | What to do |
|---|---|---|
| Cheap | Low | Good candidate to start with |
| Cheap | High | Possible, but with tight controls and 100% review |
| Expensive | Low | Usually doesn’t pay off |
| Expensive | High | Don’t take it on at the start |
A sign you’ve picked the wrong task: people say “it’s easier to do it myself than to check after it.” That’s usually not laziness, it’s the truth — a signal that either the task is wrong or the mode is wrong.
From our own work. A platform for as-built construction documentation (our PTO pilot): the AI prepares the certificates and registers, finds and matches quality documents, checks completeness. The engineer reviews and signs. The cost of error is clear and visible: the construction-control inspector will send an incomplete package back for rework, the site stalls, payment slips. So here the human review isn’t a formality — it’s the very part of the work the whole thing was set up for.
A side note: tasks where the mistake is both expensive and invisible (final reporting, legal wording, medical decisions) are best kept out of the pairing entirely at the start. Not because AI can’t handle them, but because the cost of a missed mistake outweighs all the savings.
Where review breaks down
Say the mode is right and the task is a good fit. Human review still breaks down — not because people are bad, but for plain mechanical reasons. That department from the opening didn’t break because of laziness. Let me go through the points where I see it break most often.
Attention dulls. If the AI is almost always right, the human stops reading and starts mechanically approving. This is the “approve”-button operator. The diagnostic sign is simple: the edit rate trends toward zero. A solid month of everything getting approved without a single change almost never means the AI is perfect. It means review is dead.
The interface nudges you to approve. Review quality depends heavily on how the screen is built. If “approve” is one button, but fixing something means rewriting the whole thing by hand, people will approve. That’s not weak character, it’s normal behavior in a badly designed system. What helps: show the source next to the AI’s answer; give a fast way to amend rather than rewrite from scratch; show exactly what the human changed; make rejecting the AI’s answer an ordinary action rather than a feat of heroism; and don’t bury “I’m not sure” in tiny gray type.
Calibration is off. A person needs to know where the AI is strong and where it routinely lies. Both extremes hurt. Too little trust — the human re-verifies everything from scratch, including what the AI does reliably, and the pairing loses its point. Too much — the human believes the AI even where it’s wrong, especially under pressure and in a hurry. And a hurry under load is exactly when mistakes are most likely.
The reviewer is overloaded. The AI produces five times as much, and the same person has to review it at the old pace. First they try, then they review superficially, then they burn out. You can’t crank up the input flow without recalculating how much a human can physically review on the output.
The reviewer has no authority. The most underrated failure, and it’s not about the person — it’s about the organization. The human sees the problem but can do nothing: their KPI demands closing a hundred items a day, there’s no “send back to the AI for rework” button, and their manager scolds them for delays but not for missed mistakes. In a system like that, the person quickly figures out that approving is the smart move. The reviewer needs the right to stop a result, send it back for rework, request a source, or reroute the task. If they’re accountable for the mistake but can’t stop the release, that’s not oversight — that’s a designated fall guy.
The sum of all these failures is the illusion of productivity. AI lets you produce a lot, fast. But ten unreviewed drafts aren’t more work done — they’re more raw material to process. Speed of generation without speed of review gives you not productivity but a pile-up. That same gap: “feels like a lot, but barely anything reached a finished state.”
And the crucial part: all of these failures happen invisibly. From the outside, everything looks like it’s working — right up until the first mistake flies out. To see the failure before it reaches a client, you have to make review visible.
If you can’t see the review, it isn’t there
This is where the whole article turns.
In a pairing with AI, a person’s work is less and less about “doing” and more and more about “checking and signing off.” The AI takes over generation; the human’s value and responsibility shift to review. And if that review is never recorded, then from the company’s point of view it’s as if it never happened — exactly like the shadow use we started with.
Review is work, not “well, he glanced at it.” It has to be made visible and quantifiable, like any other work. That means agreeing: what exactly we’re checking (by what criteria the result passes), how deeply (a draft internal email gets a quick look, a contract gets read line by line — different protocols), and whether it’s full or sampled (at scale, sampled review plus full review of everything the system flagged as doubtful is often sensible). If you can’t say how a task should be checked, it’s too early to hand it to a pairing.
And one more thing: the system needs rules for when a result can’t be released automatically — low model confidence, conflicting sources, missing data, an expensive mistake. Not “the AI will honestly say it’s unsure” (it owes you nothing on its own), but defined thresholds and escalation rules that route to a human.
The digital trail is a record of how a result came to be. In plain terms: what the AI produced, what the human changed, who put the final sign-off, and when. It doesn’t have to be a system with hashes and immutable logs right away (in regulated industries it’ll get there); at the start, an honest log of who did what is enough.
Why a company wants this:
- Accountability becomes real. There’s a specific person who accepted the result. Not “the team,” not “the AI,” but Ivanov on such-and-such date. This cures diffuse responsibility, where you can’t find who’s to blame.
- Post-mortems become possible. When a mistake gets out, you can see at which step. Did the AI output it that way and the human approve without looking? Or did they edit it, but badly? Without a trail it’s guesswork; with one it’s a concrete conclusion and a concrete fix.
- The button operator is visible. From the trail you can immediately see whether the human is actually reviewing: the share and depth of edits. Collapsed to zero — the pairing has become a fiction, time to step in. Without a trail you’ll only notice via a mistake that flew out the door.
- The company is protected. In front of a client, in a dispute, before a regulator, you can show that human oversight wasn’t just talk. In Europe this is moving from “good practice” toward a requirement: for high-risk AI systems, the law already explicitly demands both logging and human oversight (the EU AI Act; the application dates depend on the type of system, so it’s best not to pin it to a single date). Even if it doesn’t apply to you directly, the direction is clear.
A sobering detail: 2026 estimates suggest fewer than half of corporate AI agents are actively monitored and protected. That doesn’t mean everyone else has no logs at all, but the scale of the visibility problem is clear. Whoever gets this in order first wins an edge — not on hype, but on real oversight.
From our own work. In Torgi, the agent’s decisions and the human’s responses are written to a shared store, and the company profile (which the agent uses to pick tenders) changes only through a chain: signal → proposal → show the difference → human confirmation → apply → record. The agent doesn’t edit the rules itself; every change has an author and a history. That’s a digital trail in working form: you can see who decided what, and why.
Don’t overdo it. A trail isn’t an excuse to wrap people in surveillance and protocols across three systems. Do that and people will either stop using the AI or start cutting corners on the records. The principle: depth of trail proportional to cost of error. Where it’s cheap — the minimum; where it’s expensive — in detail. Better a simple, working log than a beautiful system everyone sabotages.
What happens to people’s skills
A trail isn’t only about accountability. It also shows what to teach people and where they’re losing their grip on the work. And that’s a separate problem, noticed late, because it builds up over years.
If the AI always makes the first pass, within a year an experienced employee starts forgetting how to do it themselves. This isn’t theory anymore: skill erosion from over-reliance on AI is named among the top risks in 2025–2026 research. For a business this is a concrete loss: expertise that lived in people walks out the door, and there’s no one left to check the AI on the non-standard cases.
For juniors it’s worse. The experienced person at least forgets — they had the skill. A newcomer who starts straight from the AI’s finished first pass never builds the skill at all. They don’t go through the normal trajectory of mistakes, they don’t learn to tell good output from bad, because they’ve never seen a result assembled by hand. A couple of years later, you’ve got a person who can click “approve” but can’t understand what they’re approving.
The remedy is unpleasant but it works: deliberately keep part of the work in manual mode, debrief mistakes, teach from the edits, pair the newcomer with a mentor and not just with the AI. And use that same digital trail not to punish but to teach: the saved edits show what most often goes wrong and where the human is losing the thread.
What changes for management
Put it all together and what changes isn’t the amount of a person’s work, but its content. And the old ways of measuring and loading people break.
The role shifts from doer to editor and decision-maker. The person produces less and checks, edits, and handles more judgment calls. That’s a different kind of work and a different kind of fatigue: an hour of dense review with constant decision-making is more draining than an hour of familiar production. Fail to account for it in workload and people burn out, and you won’t understand why. And honestly: this isn’t always an upgrade. A well-designed pairing makes the work more substantive; a badly designed one turns a specialist into an expensive buffer between the AI and the “send” button.
The old output norms lie. “A manager handles N requests a day” describes a world where they did them by hand. In a pairing you have to redefine what counts as a unit of their work and how much they can actually review without sliding into the “approve” button.
Measure the outcome, not the activity. Not “how many answers the AI generated” and not “how many employees have access,” but how much work passed through the pairing and produced a usable result: how much reached the client error-free, how much was accepted with edits, how much came back for rework, how many disputed cases went to a human.
You need an owner of the result and an owner of the metrics. One person accountable for the pairing’s output, one person making sure the measurements get collected. Without these two roles, everything blurs.
And a question to answer honestly before you launch: where will the freed-up time go. There are essentially four options — more of the same work, new work, higher quality, or a rethink of the staffing model. With no decision, the savings simply dissolve into paid idle time.
Where to start trying
- Take one process where checking is cheap, doing is expensive, and the mistake is visible. Not the most important one — the most clear-cut and repeatable.
- Choose the mode deliberately. Almost always start with Mode 2: AI does it, human reviews everything. Autopilot comes later, once quality is proven.
- Define what a correct result is and how to check it. Can’t? Do that first.
- Appoint an owner of the result and an owner of the metrics, and give the reviewer the right to stop a release. Can be one person, but make it explicit.
- Set up a simple digital trail from day one. Even a spreadsheet: what the AI output, what got edited, who accepted it.
What to measure (the minimum, all of it flowing from the trail):
- Share and depth of edits. Zero — review is dead. Almost everything rewritten — the AI can’t handle the task.
- Time to review one unit versus time to do it from scratch. Close together — the pairing isn’t speeding things up; change the mode or the task.
- Output quality: how many errors reached the client compared to before AI.
- Real usage: what share of the flow went through the pairing — not what percentage of people “have access.”
Three months of measuring like this gives an honest answer about whether the pairing works or needs rebuilding. And unlike a pretty deck, the answer is yours, on your data.
Instead of a conclusion
Back to the department from the opening, where oversight quietly died. There was no ill intent anywhere: a person was put in the loop, the work flowed, everyone was busy. Just one thing was missing — the conditions under which review could work at all.
There are essentially four of them. Time to review, not rubber-stamp. An interface where fixing is as easy as approving. The authority to stop a release rather than silently wave it through. And a trail that shows the review actually happened.
Remove even one and the human in the loop stops being a control. They become an alibi. And they cost as much as a control.
