A 90-day plan to a working AI pilot

Why pilots fail

Across a hundred-plus AI engagements, the failure pattern is depressingly consistent. A team picks a use case that looks impressive in a demo (chatbots and code-gen, mostly). They run a sprint, build something cool, present it to leadership, and then discover three things: the use case wasn't actually painful, the data needed for production is in a system nobody owns, and there's no operating model for who runs the thing once IT walks away.

The good pilots avoid this not by being smarter, but by being more boring in the first 30 days and more ruthless in the next 60.

The 90-day rule If you can't get a working pilot in front of real users within 90 days, the project is either too ambitious for a pilot or there's an organisational blocker the pilot won't fix. Stop. Rescope.

Days 1–30: Discover

This phase is unglamorous on purpose. No model, no code, no demo. The output is a one-page brief that answers four questions.

1. What's the value at stake?

Not "we want to use AI." Specifically: which process, how many FTEs touch it, what's the cycle time today, what would a 30% improvement be worth in dollars or in customer experience? Pilots that can't answer this in numbers should not be funded.

2. Is the data accessible?

Walk through the data the model needs end-to-end, from raw source to where the answer must land. We do a literal walkthrough — someone opens each system, fetches a real record, and we time it. If the data sits in three departments, two file shares, and one spreadsheet maintained by a single person on a laptop, you don't have a data problem, you have an organisational problem. Surface it now.

3. What does "good enough" look like?

Define accuracy, latency, and cost thresholds before you build anything. The accuracy bar for a credit decision is not the accuracy bar for a marketing email draft. Picking the threshold up-front prevents the dynamic where the team spends six weeks building something and three weeks arguing about whether it's good enough.

4. Who owns it on day 91?

If the answer is "we'll figure that out," the pilot will end up belonging to nobody. Name a business owner, a tech owner, and a control owner before any code is written.

Day 30 gate One-page brief approved by an exec sponsor. If you can't get the sponsor to sign, the use case isn't ready. Better to discover that now than at day 75.

Days 31–60: Build

This is where most teams want to start. We start here only after the discovery brief is signed. The build phase has three streams that run in parallel.

Stream A — Solution architecture

The technology choices in 2026 are simpler than they were in 2024. The questions you have to answer are not which model; they're:

Where does the inference run? Cloud-native API, dedicated capacity, or on-prem? Driven by data sensitivity and latency.
What's the retrieval layer? A vector store, a structured database, an existing search index, or a hybrid? Driven by what the model needs to ground on.
How does it surface to the user? Embedded in an existing tool, a new lightweight interface, an API for another system, or a queue worker? Driven by where the user already lives.

Stream B — Build & integrate

Build in two-week increments with an internal user testing on Friday of every sprint. Two non-negotiable practices:

Eval suite from day one. A repeatable test that scores the model against a fixed dataset. Without this, every prompt change is a guessing game. With it, you can measure progress objectively and have a real conversation about trade-offs.
Human override built in. The user must be able to disagree with the model and have their disagreement captured as a labeled example. This is how the model gets better and how trust is built.

Stream C — Controls & governance

In parallel: who reviews the model's outputs and how, what's the escalation path for an exception, what audit trail is needed, where does PII appear in the prompt or the response, and how is it handled. None of this is glamorous. All of it is what determines whether the pilot survives a security review.

Days 61–90: Deploy

The deploy phase has two parts.

Shadow run (days 61–75)

The model runs on real production traffic, but its outputs go to a queue, not to a customer. Humans process the same tickets and the model's answer is compared. You're measuring three things:

Agreement rate — how often does the model's answer match the human's?
Time delta — how much faster (or slower) is the model?
Failure modes — when the model gets it wrong, what's the pattern? Wrong category? Wrong tone? Hallucinated detail? Each pattern has a different fix.

Cut-over (days 76–90)

If shadow-run metrics meet the day-30 thresholds, you cut over progressively: 10% of traffic, then 25%, then 50%, then 100%, with a one-click rollback. If they don't, you don't cut over. You go back into the build phase with a clear list of what to fix, not into a "phase two" conversation.

Don't fall in the demo trap

The demo trap is when the project's success is measured by how impressive the demo is, not by how much it changes operations. The defense is to instrument the operating-model change from the start. Pick metrics that only move if the pilot is actually being used:

Tickets handled per FTE per day
Cycle time from intake to resolution
Cost per transaction
First-time-right rate

If those don't move at day 90, the pilot didn't work — regardless of how good the demo was.

Common pitfalls (and the fixes)

Pitfall	Fix
"We'll build the perfect prompt"	Build the eval suite first. Improvement should be a number, not a vibe.
No business owner	Refuse to start without one named. The exec sponsor isn't the owner.
PII discovered late	Data audit on day 5, not day 75. Surface red-flag data classes immediately.
Vendor lock-in by accident	Abstract the model behind an interface. Switching providers should cost a sprint, not a project.
"It works on my laptop"	Demos run on the same infrastructure as the eventual production system. No exceptions.
Adoption forgotten	Pair the pilot with a named change-management lead from day one. Treat adoption as a deliverable, not an afterthought.

What "good" looks like at day 90

One narrow, real use case running in production on at least 10% of traffic.
Eval suite running in CI. Every prompt or model change is scored automatically.
Named owners, named oncall, named escalation path.
Business metric movement attributable to the system, not anecdotes.
A short, honest list of what doesn't yet work and a plan to fix it.

That isn't glamorous. It's also the foundation everything else gets built on. The teams that get to a working pilot in 90 days usually have ten in production by month twelve. The teams that don't tend to still be running pilots a year later.