Bespoke AI Solutions UK: A Document Processing Case Study

April 2026 follow-on — what we'd do differently today. The scenario above was scoped when most SME "AI" meant document intelligence or a Copilot rollout. In April 2026 two things have shifted the design. First, Agentic AI — a build of this shape today would almost certainly be framed as an agent that owns the triage-classify-extract-post flow end to end, not a pipeline with human hand-offs at each step. Second, model choice is cheaper and more interchangeable: Claude Opus 4.6 is in the M365 Copilot Premium model selector and Claude Sonnet is rolling out inside Copilot Chat Frontier, so an equivalent build today is less locked-in than a 2025 build would have been. The 95%-of-pilots-fail finding from MIT Project NANDA still holds the design rule though: start narrow, prove one workflow, then extend. If you'd like to talk about what a 2026-shaped version of this build would look like for your firm, book a discovery call.

Bespoke AI solutions UK get written about in abstract terms — strategy decks, maturity models, “AI transformation”. This post is the opposite. It walks through an end-to-end worked example of a four-week build for a UK accountancy firm — the brief, the build sequence, the kind of numbers a project like this should produce, and a clear rule for when this sort of work is the right answer and when it isn’t.

A note on framing: this is an illustrative scenario, not a past client engagement. The firm is composite and the figures are directional benchmarks for this type of build — grounded in how we scope and deliver bespoke AI work, not a specific delivered project. If you’d like to discuss a real build for your firm, book a discovery call.

Why the firm needed a bespoke AI tool, not Copilot

Picture a 40-person accountancy practice in the North of England. Their bottleneck sits in a single process: inbound client documents — invoices, bank statements, payroll summaries, receipts — arrived by email, portal upload, and the occasional post, in wildly mixed formats. A team of three spends most of its week triaging, classifying, extracting key fields, and keying the results into the firm’s accounting platform.

They’ve tried the obvious things first. Microsoft Copilot helped with emails around the process but couldn’t do the processing itself. A mainstream OCR tool handled clean invoices but fell over on anything scanned, rotated, or written in the client’s own template. A scripted automation someone built in Power Automate was brittle — every new document format broke it.

This is the “good enough” trap. Three different off-the-shelf tools, each solving 60% of the problem, none of them joined up. The staff workload kept growing as client volume rose. The firm knew they needed something different — a custom AI tool for a UK SME, designed for their exact document taxonomy, not a generic product they kept having to work around.

The brief — what success looked like

Before anyone writes a line of code, four measurable outcomes would be agreed with the managing partner. Everything downstream — architecture, scope, sign-off — refers back to this brief.

Throughput: at least 50% reduction in staff time per document, measured against a logged baseline.
Accuracy: extraction error rate no worse than the existing manual baseline, validated on a 200-document test set.
Cost per document: under 10p fully loaded, including API tokens and infrastructure.
Governance fit: UK data residency, full audit trail, no client data touching public AI models.

Equally important were the deliberate exclusions. No chatbot interface. No expansion into other workflows. No “while we’re at it” features. The scope discipline is usually what separates a bespoke build that ships in four weeks from one that limps on for six months. If this pattern looks familiar, we cover the same discipline in our how we work page.

The build — four weeks from scoping to production

We run most SME builds on a four-to-six-week timebox. A scenario like this lands cleanly at four weeks when the firm has a single decision-maker and a clean document sample ready on day one. Here’s how the weeks break down.

Week 1 — Discovery and scoping

Two sessions with the operations lead, a sit-alongside day with the processing team, and a collection of 200 anonymised documents covering every format the firm sees in a typical month. The output of week one is a document taxonomy (around 12 categories), a field extraction specification per category, and a test set with known-correct answers. Nothing is built yet — and that’s the point. Jumping into code before you know what “correct” looks like is how these projects fail.

Week 2 — Prototype on the Claude API

The engine is the Claude API for business, running in a UK Azure region. We build a classification prompt for the 12 document types, then per-type extraction prompts tied to the field specification. The prototype is run against the test set at the end of week two. On a well-specified build like this, classification accuracy in the mid-to-high 90s on first pass, and extraction accuracy in the low 90s before any tuning, are realistic benchmarks. Numbers in that range confirm the architecture is sound; the remaining two weeks are about turning a prototype into a production tool.

Week 3 — Integration, testing, UK-resident deployment

Week three is the engineering week. We integrate the tool with the firm’s email-handling inbox and document storage, deploy to UK South in Azure with private networking, add the audit trail (every prompt, response, token count, and human override logged to a queryable store), and build a lightweight review interface for the cases the model flags as low confidence. Human-in-the-loop isn’t a weakness; it’s the design choice that makes accuracy claims credible.

Week 4 — Staff training, rollout, handover

A half-day session for the processing team covers two things: how to use the review interface, and how to recognise when the tool is being asked to do something outside its training scope (new document types trigger escalation to the owner). We’d run in parallel with manual processing for five days to validate the numbers, then switch the team to the new tool as their primary workflow. Handover documentation, the evaluation harness, and a quarterly review schedule round out the week.

The results — what a build like this should deliver

At an eight-week post-go-live review, a well-scoped project like this should track against the original four-point brief roughly as follows. The numbers below are directional benchmarks for this type of build — the shape of the outcome, not a specific client result.

Throughput. Staff time per document falling by around 70%, against a 50% target, is realistic when the document taxonomy is tight. A three-person processing team reduces to one person handling oversight and exceptions.

Accuracy. Extraction error rate roughly 60–70% lower than the pre-automation manual baseline. The improvement is partly model quality and partly that humans catch fewer errors when they’re tired; the AI doesn’t get tired.

Cost per document. Fully loaded cost in the low single pence, dominated by token cost. Previous manual cost on this kind of workflow is typically £1–£2 in staff time. Payback on the build usually lands inside six to twelve weeks.

Governance. All processing in UK South. The audit trail meets ICO accountability obligations and slots into the framework we cover in AI governance for UK SMEs.

The freed-up capacity shouldn’t become redundancies. The outcome to design for is that existing staff move into higher-value client-facing work and the firm onboards clients it previously couldn’t take on. Efficiency gains that reduce headcount are easy; efficiency gains that grow the business are the reason bespoke AI is worth building.

What made this a “bespoke AI” build, not a script

Calling something “AI” is fashionable. It’s worth being precise about what the label actually means in this kind of project. Three things distinguish this build pattern from a scripted automation or an off-the-shelf product.

Prompt engineering tied to the firm’s taxonomy. The prompts encode the firm’s document categories, extraction rules, and exception conditions. A generic tool has generic prompts. A bespoke tool has prompts written against your business.

Human-in-the-loop where it matters. The target isn’t 100% automation. It’s the model being accurate on the ~85% of documents that look like the test set and flagging the remaining 15% for human review. That split is the sweet spot for professional services work where errors have real consequences.

An evaluation harness, not vibes. Every change to the prompts or the architecture is re-run against the test set before it goes live. Quality isn’t a feeling — it’s a scoreboard. This is the single biggest thing that separates professional AI integration services from prompt-hacking dressed up as consulting.

When to build bespoke, and when not to

Intellectual honesty matters here. Bespoke AI is the wrong answer for most problems most of the time. We apply three tests with clients before we recommend a build.

Volume test. Is the workflow handling at least a few hundred transactions a month, with a realistic path to more? Lower than that and the amortisation doesn’t work — use an off-the-shelf tool or automate with Microsoft 365 tooling.
Specificity test. Is the work genuinely specific to your firm — your taxonomy, your rules, your templates — or is it a generic pattern that a SaaS product already handles? If a product exists, pay for the product.
Process test. Is the underlying process stable and well-understood? If the business process is broken, AI will automate the broken version and make the mess faster. Fix the process first, then automate.

The scenario above passes all three — high document volume, work specific to the firm’s client mix, a process refined over years. That’s the profile where bespoke pays back. If your workflow fails any of the three tests, our honest answer is usually “don’t build one.”

If you think your workflow might pass the three tests — or if you’re not sure — that’s the right first conversation to have. Our solutions page covers the shape of the builds we take on, and a 30-minute discovery call is usually enough to tell whether bespoke is the right answer or whether a simpler tool will do.

FAQ

What’s the difference between a bespoke AI tool and Microsoft Copilot? expand_more

Microsoft Copilot is a general-purpose assistant embedded in Microsoft 365. It’s excellent for drafting, summarising and Q&A across the tools you already use. A bespoke AI tool is built for one specific workflow in your business — with your document taxonomy, your validation rules, your audit trail, and your systems of record. Copilot answers questions. A bespoke tool performs a defined job end-to-end and returns a structured, auditable result. For most UK SMEs the right answer is Copilot for generic productivity plus one or two bespoke tools where the ROI is obvious. You don’t need bespoke for everything — only for the workflows that are high-volume, rules-heavy, and badly served by generic tools.

How long does it take to build a custom AI tool for a UK SME? expand_more

For a focused, single-workflow tool with a clear brief, four to six weeks from kick-off to production is realistic. That assumes a named owner on the client side, access to representative documents or data within the first week, and a willingness to scope tightly. Projects that overrun are almost always projects that expanded mid-build or lacked a decision-maker. Larger or cross-departmental builds take longer, but a good consultancy will still sequence them as a series of four-to-six-week deliveries rather than a single long project. Short timeboxes are how you preserve business accountability and avoid the six-month custom software trap.

What does a bespoke AI solution cost a UK business? expand_more

Build costs for a focused SME tool typically fall in the £8,000–£25,000 range depending on complexity, integration depth, and whether the firm needs ongoing support. Running costs are usually dominated by API tokens, which for most document or email workflows come to a few pence per transaction — often less than a penny. That’s orders of magnitude cheaper than the equivalent staff time. Payback is commonly six to twelve weeks on high-volume workflows. The right question isn’t “what does it cost” but “what does the current manual process cost, and how many months before the tool pays itself back.”

Is our data safe with a custom AI tool built on the Claude API? expand_more

Yes, if the build is done correctly. The Claude API for business does not train on your data by default, and enterprise agreements specify zero retention beyond the processing window. For UK clients we deploy infrastructure in UK Azure regions, encrypt data in transit and at rest, and route all calls through a controlled backend so client documents never touch a public chatbot. Every prompt and response is logged to an audit trail that satisfies ICO expectations for accountability. Data residency, processing location, retention, and training opt-out are the four questions every UK SME should get in writing before any tool goes live — generic consumer AI accounts can’t answer them.

When is a bespoke AI build the wrong answer? expand_more

Three situations. First, if an off-the-shelf tool already does the job — paying to rebuild what Copilot or a SaaS product handles is waste. Second, if the workflow is low-volume or one-off — the amortisation doesn’t work on fewer than a few hundred transactions a month. Third, if the business process itself is broken — AI will automate the broken process faster, not fix it. Bespoke is right when the volume is high, the rules are specific to your firm, the gains are measurable, and the underlying process is stable. If any of those three are missing, either pick a different tool or fix the process first.