DevTools

FOBI Editorial

FOBI Editorial
The devtools AI stack is fragmenting before it even consolidates—and the real battle isn’t models, but who controls the verification layer.
07.05.26

Multiple bets

Long-horizon autonomous engineering

As coding agents move beyond isolated code generation and bug fixing, the central challenge increasingly lies in sustained, multi-target software development, with evaluation shifting from short-horizon defect repair to long-horizon feature implementation. Every AI agent experiences performance degradation after 35 minutes of human time spent on a task, representing a fundamental challenge as agents scale from short interactions to extended operations. Solving this unlocks agents that can autonomously complete multi-day engineering tasks—effectively unblockling a new tier of automation.

Approaches in flight

Long-horizon autonomous engineering

▸ Long-context reasoning models with execution loops

OpenAI introduced GPT-5-Codex as the first version of GPT-5 optimized for agentic coding in September 2025, with GPT-5.2 released in December 2025 marking the moment people began to believe autonomous coding agents could be reliable. Codex ran for about 25 hours uninterrupted, using about 13M tokens and generating about 30k lines of code, performing well on the parts that matter for long-horizon work: following the spec, staying on task, running verification, and repairing failures as it went. Claude Opus 4.6 achieves 50% task completion at a 14.5-hour autonomous horizon, with no competing model publishing a comparable benchmark. However, METR's current Time Horizon 1.1 data estimates Claude Opus 4.6 at a 50% time horizon of roughly 11h59m and an 80% time horizon of roughly 1h10m, exposing the gap between marketing claims and reliable sustained operation.

▸ Hierarchical planning and goal decomposition

Long-horizon planning has moved from a research aspiration to an engineering discipline in 2026, with architectural innovation including hierarchical goal decomposition, DAG-structured subgoal planning, tiered memory compression, reflective verification loops, and explicit goal drift mitigations. The Planner-Worker model has emerged as the dominant architecture for long-running agents, adopted by leading systems, directly addressing the "35-minute degradation problem" by breaking long tasks into manageable chunks that fit within the effective performance window. EPAM's engineering analysis of long-horizon agents in production identifies that agents with ambiguous or underspecified goals drift more quickly, with precise initial objectives and explicit success criteria non-negotiable for tasks beyond 20 minutes.

▸ Formal verification and deterministic output correction

A basic loop alternating LLM generation with Lean formal verification replicated all 9 successes on its own, with Lean catching hallucinated lemmas that informal proofs let through. Simple generate-and-verify patterns with formal tooling are production-ready for any domain with verifiable output structure, with the cost floor for automated theorem proving dropping from millions of dollars to hundreds, transferring directly to code generation, data validation, compliance checking, and any other domain where output correctness can be checked automatically. This approach applies broadly across devtools where agents produce executable or verifiable artifacts.

▸ Dynamic context and state management

Getting agents to make consistent progress across multiple context windows remains an open problem in 2026, with agents working in discrete sessions where each new session begins with no memory of what came before. Trajectory sanitization before handoff filters the trajectory to include only state-fact entries, removing speculative reasoning, while hard session length limits with explicit handoffs prevent unbounded context accumulation while preserving task continuity across sessions. Anthropic, OpenAI, and Cognition AI are competing on memory architectures that maintain coherent state without context explosion.

DevTools

DevTools

FOBI Editorial

Latest Stories

Companies

Benchmarks

KPIs

Latest News

Q1 2026 Innovation Graph update: Open source collaboration is accelerating worldwide

The Benchmark Meaning Gap

Videos

Talent Moves

Catalysts

Conferences

Earnings Calls

Venture Stages

Valuations

Funding & analysis

Bottlenecks

Long-horizon autonomous engineering

Long-horizon autonomous engineering

AI code correctness and trust

AI code correctness and trust

Codebase context reasoning

Codebase context reasoning

Supply chain security at scale

Supply chain security at scale

Observability data cost and correlation

Observability data cost and correlation

Multi-agent orchestration reliability

Multi-agent orchestration reliability

Investment Theses

AI Makes Every Knowledge Worker a Developer

AI Makes Every Knowledge Worker a Developer

When Agents Write the Code, Verification Becomes the Scarce Asset

When Agents Write the Code, Verification Becomes the Scarce Asset

Code-Specialized Foundation Models Capture Disproportionate Value Against General-Purpose LLMs

Code-Specialized Foundation Models Capture Disproportionate Value Against General-Purpose LLMs

Frontier Model Access Becomes a Regulated Good — and Hedging It Becomes a Business

Frontier Model Access Becomes a Regulated Good — and Hedging It Becomes a Business

Top 10

Investors

Books

Structure and Interpretation of Computer Programs, 2nd Edition

Compilers: Principles, Techniques, and Tools, 2nd Edition

The Mythical Man-Month: Essays on Software Engineering, Anniversary Edition

The Pragmatic Programmer: Your Journey To Mastery, 20th Anniversary Edition

A Philosophy of Software Design, 2nd Edition

Site Reliability Engineering: How Google Runs Production Systems

Designing Data-Intensive Applications

Accelerate: The Science of Lean Software and DevOps

AI Engineering: Building Applications with Foundation Models

Co-Intelligence: Living and Working with AI

The Phoenix Project: A Novel About IT, DevOps, and Helping Your Business Win

Competing in the Age of AI: Strategy and Leadership When Algorithms and Networks Run the World

JetBrains AI for Teams and Organizations: From Fragmented AI Usage to Coordinated Software Development

A global workspace in language models

Your Worker can now have its own cache in front of it

Natvis Comes to Linux and macOS: Visualize Your C++ Types Without Writing a Single Data Formatter

The Safari MCP server for web developers

Toolbox App 3.6: Smarter Storage Cleanup, Windows installation diagnostics, and More

6 security settings every GitHub maintainer should enable this week

Announcing the Monetization Gateway: charge for any resource behind Cloudflare via x402

Content Independence Day, one year on: building the business model for the agentic Internet

Your site, your rules: new AI traffic options for all customers

Predictions

Policy & Courts

Round sizes

Stage mix

Lead investors

Publications

Conferences

University labs