Vocap

If you are a Vertical AI founder, Raphaëlle D'Ornano's latest piece on Cerebras is worth reading. Cerebras is an AI chip company that built the largest computer chip ever made, roughly the size of a dinner plate ("wafer scale silicon"), designed to run AI models faster than anything else on the market. It went public this month at a $66 billion valuation.

D'Ornano's piece is ostensibly a structural bear case on that valuation. Her argument: Cerebras built a brilliant chip for the inference economy of 2023, but real world AI today doesn't run one big model on one big chip. It strings together multiple smaller models, lots of tool calls, and a heavy dose of company specific context, all coordinated by traditional CPUs. The moment the AI model actually thinks is no longer where the binding latency lives. Routing, tool calls, retrieval, policy enforcement, and state management are. That's a structural problem for wafer scale silicon. It's a structural opportunity for Vertical AI.

We've written before that the intelligence layer is commoditizing, and the execution layer is the prize. D'Ornano's piece is the closest thing to production telemetry we've seen confirming that thesis. Here's what it means for founders building in Vertical AI.

Orchestration Is Now Empirically the Value Layer

The most important data comes from Datadog's State of AI Engineering 2026, drawn from more than 1,000 production customers:

More than 70% of production organizations now run three or more models
69% of all input tokens are system prompts (tool guidance, policy definitions, internal instructions), not user input
Adoption of agent frameworks (LangGraph, LangChain, Vercel AI SDK, Pydantic AI) has roughly doubled in twelve months

The model itself is increasingly a commodity input. The work that turns a model call into reliable output sits in routing logic, tool integrations, retrieval, policy guardrails, and state management. That work is hard, domain specific, and accumulates value over time. It's exactly what a Vertical AI company is positioned to own.

The defensible layer in your product is not the LLM call. It's everything wrapped around the call. If your team is still framing your value as "we use AI to help customers do X faster," you're underselling what you actually do and giving foundation models too much credit for the work.

Multi Model Is Now Table Stakes

The era of picking one AI provider and building your entire product on it is ending. The companies winning are running a portfolio of models the way a logistics company runs a portfolio of carriers, routing each job to the option that delivers it best.

Any pitch built around "we bet everything on one model provider" should be a yellow flag in your own product reviews and in our diligence. The architecture you ship needs to be model agnostic from the start, with intelligent routing by cost, latency, and risk profile.

This is also a real gross margin lever most founders aren't working hard enough. Easy queries belong on cheaper, smaller models. Frontier inference should be reserved for genuinely hard cases. Companies that build sophisticated model portfolio management into the product see materially better unit economics. Every founder should know, right now, what percentage of inference cost is going to overkill on routine queries.

The CPU Is Back, and So Is the Forgotten Engineering Work

For the last three years, the narrative has been that GPUs (the specialized chips that run AI models) are the only chips that matter. The latest research says that's wrong.

A Georgia Tech and Intel paper measuring end to end latency across five representative agentic workloads concluded that tool processing on CPUs eats a meaningful share of total latency, and as GPUs get faster, the bottleneck shifts further toward the CPU, not away from it.

This reframes engineering priorities:

Efficient CPU side coordination, caching, and tool execution matters more than maximizing GPU throughput
Companies running everything through the most expensive accelerator they can find are leaving margin on the table
Reliability engineering (multi provider failover, capacity contracts, queue management) is a differentiator customers feel, given that capacity, not model intelligence, is what's actually breaking in production

It's not glamorous work, and it doesn't get talked about at AI conferences. But the gap between companies that take it seriously and companies that treat it as plumbing will show up in retention, NPS, and gross margins over the next four quarters.

Context Is a Moat and a Cost Problem at the Same Time

The average AI task is consuming twice as much computing input as it did a year ago. For the most demanding users, it's four times as much. That's production teams stuffing more conversation history, more retrieved documents, more tool outputs, and more policy guardrails into every model call.

For Vertical AI companies sitting on proprietary domain context (clinical notes, deposition transcripts, underwriting files, demand forecasts, claims histories, supplier specs), this is a double edged sword:

The moat is real. The depth of context you can pull into a workflow is something a horizontal platform cannot easily replicate.
The cost problem is also real. Per token prices are coming down, but tokens per task are rising faster. It's like cell phone data getting cheaper while apps consume it twice as fast. The bill still goes up.

Every founder should be probing tokens per task trajectories, not just gross margin snapshots. If your tokens per workflow are climbing faster than your per token cost is falling, the unit economics curve bends the wrong way regardless of what the P&L shows this quarter.

The Edge Question, Briefly

D'Ornano cites a Stanford and Together AI study showing the share of single turn queries served on consumer grade hardware has risen from 23% to 71% in two years. AMD just unveiled a development PC capable of running 200B parameter models locally.

Single turn Q&A over a corpus is becoming a feature, not a product. It will increasingly run on the user's device, free, with no SaaS layer required. If that's the bulk of your value proposition, you have a problem on a 24 to 36 month horizon.

The good news is that heavyweight, multi step agentic workflows (the part that genuinely requires orchestration, state management, integrations, and domain logic) are not going to run on a Snapdragon. Vertical AI companies whose value lives in that layer are more durable, not less, because of where compute is migrating. The mandate is to make sure your product actually lives there.

Build for the workload that's actually arriving, not the one the headlines describe.

TL;DR