For most of the last two years, an AI coding session was measured in minutes. You prompted, the model produced a chunk of code, you reviewed it, you prompted again. The human stayed in the loop on every step. Anthropic's 2026 Agentic Coding Trends Report documents something categorically different: agent sessions that stretch from minutes into hours, including one change that swept across a 12.5-million-line codebase in a single, uninterrupted seven-hour run. This is long-horizon autonomy — the agent works, the human sleeps, and the review happens on the result rather than on each step. It is genuinely powerful. It also breaks the review model that kept AI-assisted coding honest.
From Minutes to Hours: A Phase Change
The jump from minute-scale to hour-scale agent runs is not a gradual improvement; it is a phase change in how software gets built. A minute-scale agent is a faster autocomplete — the human remains the executive function, deciding what happens next and catching mistakes as they appear. An hour-scale agent is a delegate. You hand it a goal, it makes thousands of decisions on its own, and you do not see any of them until it is done. The unit of human attention shifts from the keystroke to the outcome.
A 12.5-million-line codebase is not a toy. It is the scale of a mature enterprise monorepo — thousands of interdependent modules, decades of accumulated decisions, and the kind of cross-cutting coupling that makes any change a careful surgery. That an agent can sustain a coherent change across that surface for seven hours is a real capability milestone. The same report that documents this is the one behind the headline that Claude writes 80% of Anthropic's own production code — long-running autonomy is not a demo, it is how a frontier lab already ships.
"Sessions now stretch from minutes to hours, with agents executing long-horizon changes autonomously — the human reviews the result rather than supervising each step."
What Makes Seven Hours Possible
A seven-hour run is not one long prompt. It is a stack of infrastructure capabilities that, together, let an agent stay coherent and on-task far past the point where earlier systems would drift, lose context, or corrupt their own working state. Each piece solves a specific failure that used to cap session length.
1M-token windows + compaction
- • A million-token window holds large swaths of a real codebase at once
- • Compaction summarizes earlier work so the window never fills and stalls
- • The agent keeps a working picture without re-reading everything each step
Persistent state across the run
- • Decisions and findings persist outside the context window
- • The agent recalls what it already tried and ruled out
- • Long-horizon goals survive compaction and tool cycles
Git worktrees + verification loops
- • Worktrees give the agent a sandbox that cannot break main
- • Tests and builds run in the loop, not just at the end
- • The agent catches its own regressions before they compound
The verification loop is the load-bearing piece. An agent that only writes code drifts further from correct with every step, because errors compound silently. An agent that writes code, runs the tests, reads the failures, and corrects — over and over — has a feedback signal that pulls it back toward correctness. This is why the codebases where long-running agents work best are the ones with strong, fast, comprehensive test suites. The eval gate is what lets the human leave the room. Without it, a seven-hour run is seven hours of unsupervised drift.
Why Context Compaction Is the Quiet Hero
Of all the enabling pieces, context compaction is the one most people underestimate. A naive long run hits a wall the moment its transcript outgrows the window: either it truncates and forgets what it was doing, or it stalls. Compaction breaks that ceiling by continuously distilling the run's history into a compact working summary — what the goal is, what has been tried, what the current state of the code is — so the agent can keep going for hours without ever re-reading the entire transcript. The agent operates on a rolling, curated picture of its own work rather than a bottomless log.
Compaction is also where long runs quietly go wrong, which is why it deserves attention rather than trust. A summary is lossy by definition. If the compaction step drops a constraint the agent committed to in hour one — a directory it promised not to touch, a pattern it agreed to follow — the agent can drift in hour five without ever knowing it violated an earlier decision, because that decision is no longer in its working memory. The best long-running systems treat hard constraints as persistent, pinned facts that compaction is forbidden to summarize away, separate from the soft, summarizable history of what happened.
The New Review Problem
Here is the uncomfortable trade. The whole value of an overnight agent is that you did not have to watch it work. But code review was never just about the final diff — it was about watching the reasoning unfold, asking "why did you do it that way" at the moment the decision was made, and catching a wrong turn before it propagated. When you review a giant diff you never watched get created, you have lost all of that context. You are doing archaeology on a change whose intent you have to reconstruct from the artifact alone.
At 12.5 million lines, this is not a code review — it is a needle-in-a-haystack search. The agent's output will be mostly correct, which is exactly what makes it dangerous. A diff that is 90% obviously fine lulls the reviewer into skimming, and the 10% that contains the subtle, plausible-looking error sails through. We have written about how this "almost right" problem is developers' single biggest frustration with AI code. A long-running agent does not remove that frustration; it scales it up to the size of an overnight diff and removes the step-by-step visibility that used to help you catch it.
There is also a pure economics dimension. A minute-scale agent that goes wrong costs you a minute and a re-prompt. A seven-hour agent that goes wrong on a flawed premise costs you seven hours of wall-clock time, a meaningful pile of tokens, and — worst of all — a giant diff that has to be untangled or thrown away entirely. The blast radius of a bad autonomous run scales with the run's length. Long-horizon autonomy raises both the upside and the cost of being wrong.
A Tale of Two Overnight Runs
Picture the same agent given two different tasks overnight. The first: migrate a 12-million-line codebase off a deprecated logging library to its replacement, updating every call site. The interface change is mechanical, the test suite exercises the affected paths, and correctness is checkable — either the tests pass and the build is green, or they do not. In the morning the human reviews a large but uniform diff against a passing test suite, spot-checks the few non-mechanical edge cases the agent flagged, and ships. This is overnight autonomy working as designed: the agent absorbed thousands of tedious, verifiable edits that a human would have hated doing by hand.
The second task: redesign the service's authorization model to support a new tenancy scheme. Here correctness is not a test result — it is a judgment about security boundaries, edge cases that no existing test covers, and consequences that only surface under adversarial conditions. The agent will produce something that compiles and passes the tests, because the tests were written for the old model. The seven-hour run yields a confident, plausible, and possibly dangerously wrong design that the human now has to evaluate from scratch — having gained nothing from the agent's work except a head start on a draft they cannot trust. Same agent, same infrastructure, opposite outcome, because the verifiability of the task was opposite.
Guardrails for Long-Horizon Autonomy
The answer is not to forbid long runs; it is to engineer them so that being wrong is cheap and catchable. The same lab that ships this way also knows that the model was never the hard part — as we argued in our analysis of why 88% of AI agents never reach production, durable execution and verification, not a smarter model, are what get agents shipped. Long-running agents make that doubly true. Four guardrails turn an overnight run from a gamble into an engineering practice.
Checkpoints are the most important. The agent should commit at every milestone where the tests pass, so a seven-hour run is really a chain of verified, recoverable states rather than one all-or-nothing leap. If hour six goes wrong, you roll back to the hour-five checkpoint, not to zero. Reversal criteria are the conditions, defined before the run starts, under which the agent should stop and signal a human rather than push forward — a test suite that drops below a threshold, a change that touches a forbidden directory, a repeated failure to make progress.
Eval gates are what the agent must satisfy to consider any step done: not "the code compiles" but "the relevant tests pass, coverage held, and the lint and type checks are clean." The stronger the gate, the more you can trust the unsupervised work between checkpoints. Scoped blast radius means the agent's worktree and permissions are bounded to the part of the system the task actually concerns — it cannot touch the payment service while refactoring the logging layer. A bounded scope turns "review the whole 12M-line diff" back into "review the change to these modules."
When Overnight Autonomy Is Worth It — and When It Is Not
The decisive variable is verifiability. Long-running autonomy shines on work where correctness can be checked mechanically and the change is reversible: a large-scale mechanical refactor, a framework migration with strong test coverage, a dependency upgrade that ripples through thousands of call sites, the kind of tedious, high-volume change that is easy to specify and easy to verify but enormously time-consuming to do by hand. These are exactly the tasks where a human watching each step adds nothing but fatigue, and where a strong eval gate substitutes for human supervision faithfully.
The danger zone is work where correctness depends on judgment no eval can encode: a security-sensitive change, an architecture decision with long-term consequences, anything touching a system whose failure modes are not captured by tests. Here the verification loop has nothing reliable to push against, and a seven-hour run becomes seven hours of confident drift toward a plausible-but-wrong answer. The rule of thumb is simple: if you cannot write the check that proves the agent succeeded, you cannot safely leave the room while it works.
"An overnight agent is only as trustworthy as the test that greets it in the morning. If you cannot encode success, you have not delegated the work — you have just postponed the supervision."
How Long Runs Reshape the Team's Day
The second-order effect of overnight autonomy is on the rhythm of engineering work, and it is larger than the raw productivity numbers suggest. When an agent can carry a large, well-specified change through the night, the human's job shifts from doing the change to two bookend activities: specifying it precisely enough that an unsupervised agent can succeed, and reviewing the result rigorously enough that a subtle error cannot hide. The middle — the hours of typing — is gone. What remains is harder, more cognitively demanding work at both ends.
Specification becomes a first-class skill. An agent that will run for seven hours without a human to course-correct needs a brief that anticipates the ambiguities a human collaborator would have asked about. Vague specs that worked fine for minute-scale, interactive sessions — where you could clarify on the fly — fail expensively at hour scale, because the agent commits to an interpretation early and builds seven hours of work on top of it. The discipline of writing a spec that is complete, unambiguous, and verifiable is exactly the discipline that long-running agents reward and sloppy prompting punishes.
On the review side, the team has to build new muscle. Reviewing a large diff you did not watch get created is a skill, not a chore — it means reading for intent you have to reconstruct, probing the 10% of the change most likely to hide a plausible error, and trusting the eval gates only as far as the gates actually cover. The teams that thrive will invest in review tooling — diff summarization, risk highlighting, targeted test generation for the changed surface — as deliberately as they invest in the agents themselves. The agent and the review apparatus are two halves of one system; shipping the first without the second is how the overnight win turns into a morning incident.
Conclusion: The Reviewer Becomes the Bottleneck
Long-running agents move the bottleneck. For two years it was the model's capability — could it sustain a coherent change at all? The 2026 report answers that: yes, for seven hours, across twelve million lines. The new bottleneck is the human's ability to trust and verify a large change they did not watch get made. That is not a model problem; it is a workflow problem, and it is solved with checkpoints, eval gates, scoped blast radius, and the discipline to only delegate what you can verify.
The teams that win with overnight autonomy will not be the ones with the longest-running agents. They will be the ones who built the verification infrastructure that makes a long run safe — and who are honest about which tasks belong to the agent and which still belong to a human who has to stay in the room.
Tags
Share
Building something like this? See how we ship it or start a project.