arrow_back Back to Insights CASE STUDY

Custom AI Tools for UK SMEs: A Property Management Case Study

May 2026 9 min read

Most posts about custom AI tools UK SME stop at the architecture diagram. The interesting question for any operator is the one architecture diagrams never answer: did it actually move the number, and at what cost? This piece is a deliberately concrete answer to that question. It tells the story of a single workflow at a single anonymised UK property management firm, the bespoke AI tool we built around it, the numbers it produced ninety days in, and the things we would do differently if we ran the build again.

The client is a regional property management business with around 180 staff and roughly 12,000 tenancies under management across the Midlands and North West. The names and identifying details are removed, but the workflow, the architecture, the metrics, and the lessons are real. If you are weighing up whether a custom AI tool is justified inside your own business, this should give you a defensible shape to compare against.

The starting position

The bottleneck was inbound tenant email. Around 220 emails landed in the central tenant services inbox every weekday: maintenance requests, payment queries, lease questions, complaints, end-of-tenancy notifications, deposit disputes, and a long tail of edge cases. Three full-time triagers read every email, classified it, copied the relevant detail into Dynamics 365 as a case record, assigned a priority, and routed it to the right team. The average end-to-end triage time per email, measured across a fortnight of timestamped log data, was just under fourteen minutes. The team was permanently behind.

The client had already tried two routes. Microsoft Copilot in Outlook had been rolled out to the triage team for a quarter and rejected for two reasons: it could not write structured case records into Dynamics 365 in the format the downstream workflows expected, and its handling of emails in Polish, Urdu, and Romanian — roughly a fifth of tenant volume in the relevant regions — was inconsistent enough to require human re-checking. A separate proof-of-concept with a generic third-party email automation tool had failed in the demoware-to-production transition we wrote about in why AI automation stalls in most UK businesses — demo accuracy on a clean test set did not survive contact with real, messy inbox data.

The brief to us was small and explicit. One workflow, one success metric, one custom build, ninety-day payback target.

Why a custom AI tool, not Copilot

The Copilot question is the right place to start, because it is the question every UK SME owner now asks before commissioning a bespoke build. Copilot is excellent inside the Microsoft 365 surface — drafting replies, summarising threads, retrieving context from SharePoint. It is not designed to be the spine of a regulated business process that has to drop structured records into a line-of-business system, provide an auditable trail of decisions, and handle lower-resource European languages at SLA-grade reliability.

None of that is a criticism of Copilot — it is a different product solving a different problem. We covered the underlying decision framework in Copilot vs custom AI tools; the property management case is a textbook instance of the criteria that push a workflow into the bespoke column.

The two-week discovery

We ran a paid two-week discovery before any code was written. Three workshops with the triage team, two with the head of tenant services, one with the IT lead, and a data review against four weeks of historical email logs. The deliverables were a written success metric, a workflow map, a data quality assessment, and a defined evaluation set of 300 historical emails hand-graded by the senior triager.

The success metric was sharp. Median triage time per email below four minutes across the next thirty operating days post-deployment, on no fewer than 95% of emails handled end-to-end by the tool. One sentence, signed by the head of tenant services. That sentence is the single most important deliverable of any discovery; without it, the project would have drifted into permanent-pilot.

Discovery also surfaced two issues the client had not previously named. Around 30% of inbound emails contained attachments — photos of maintenance issues, scanned tenancy documents, payment screenshots — that needed to be retained, summarised, and attached to the Dynamics case record. And multi-language handling required a deliberate prompt strategy, not just a language-detection step. Both changed the scope before any code was written.

What we built

Architecture

The build is straightforward in shape, which is usually the case for a custom AI tool that actually ships. Inbound emails arrive in the central tenant services mailbox via Exchange Online. A Microsoft Graph webhook fires on receipt and triggers an Azure Function, which extracts the body, headers, and attachments, and constructs a structured prompt for the Claude API for business. Claude returns a structured JSON response with classification, language, extracted issue summary, recommended priority, confidence score, and a draft holding reply where appropriate.

A routing layer then makes one of three decisions. High-confidence cases are written directly to Dynamics 365 as a case record and assigned to the correct team. Medium-confidence cases are written to Dynamics with a flag for human review. Low-confidence cases — or cases that trip a defined set of escalation rules — are routed to a human triager with the AI's assessment attached as a starting point. Every decision is logged with its prompt, response, and confidence to a separate evaluation store.

The architecture is deliberately simple. No vector database, no agent framework, no orchestration layer beyond what an Azure Function gives you for free. The complexity sits where it should — in the prompt, the evaluation harness, and the routing rules. This is the pattern we follow across our AI integration services: the smallest architecture that delivers the workflow, instrumented well enough to be operated for years.

Evaluation harness

The 300-email evaluation set was scored by the senior triager during discovery against the right answer for each: correct classification, correct language detection, correct priority, correct routing decision. That set was wired into a script that runs automatically against any prompt or model change before it reaches production. The harness was in place from week one of the build, which meant every prompt iteration was measured rather than guessed.

The first prompt scored 71% on classification accuracy across the harness. The third iteration scored 87%. The seventh, with multi-language examples and explicit edge-case handling, scored 94%. Without the harness, we would have shipped the 71% prompt and called it good. The harness is the dividing line between a custom AI tool that ships and one that stalls.

The five-week build

Build duration was seven weeks against an original five-week estimate, the slip almost entirely on the multi-language work discovery had underweighted. Week one was infrastructure — Azure Function, Graph webhook, Claude API integration, evaluation harness. Weeks two and three were prompt iteration, scored against the harness on every change. Week four was the routing layer and Dynamics integration. Week five was attachment handling. Weeks six and seven were multi-language reliability work and a phased rollout with a manual review safety net.

Deployment was phased over five operating days: 20% of inbound through the tool with every decision human-reviewed, rising to 50% with high-confidence decisions auto-committed, then 100% with the full routing rules running to spec. The rollout caught two prompt issues the harness had not — both edge cases involving forwarded chains absent from the historical evaluation set.

The numbers, ninety days in

Median triage time per email: 14 minutes → 3 minutes (78% reduction)

End-to-end emails handled by the tool with no human re-touch: 71%

Classification accuracy on production traffic: 93% (audited monthly)

Staff hours reclaimed across the triage team: ~9 hours per operating day

Annualised staff cost reclaimed: ~£71,000

Build and discovery cost: £32,400

Running cost (Claude API + Azure): ~£180 per month

Payback period: approximately 10 weeks measured against reclaimed staff time

The success metric was hit inside the first thirty operating days post-deployment. Median triage on emails handled end-to-end by the tool settled at three minutes against the four-minute target. Two of the three triagers were redeployed to higher-value tenant relationship work; the third now functions as the named operator for the tool, watching the success metric monthly and chairing the operate-phase review.

Two numbers matter more than the headlines. The first is the 71% of emails handled end-to-end by the tool with no human re-touch — that is the figure the staff-hours reclaimed depends on, and it has held steady across ninety days. The second is the running cost: a hundred and eighty pounds a month against a seventy-one thousand pound annualised saving. The headline cost of a custom AI tool is the engineering, not the inference. Once the build is in place, the operating economics are favourable in a way that off-the-shelf per-seat licensing rarely matches at this volume.

What we would do differently

Three things, in honesty. We underestimated multi-language complexity in discovery — the workshops should have included a deliberate sampling of non-English emails with the senior triager translating and grading live. We caught the issue in build, but it cost two weeks. We built the evaluation harness in week one but did not require it to score the human triagers themselves on the same 300 emails until week four; doing that earlier would have given a sharper benchmark from the start. And the routing rules were initially documented inside the Azure Function code rather than as a separately reviewable artefact — we rewrote them as a YAML policy file in operate phase, which is where they should have started.

When a custom AI tool beats off-the-shelf

The decision criteria are not subtle. A custom AI tool earns its keep when the workflow requires routing into a proprietary or line-of-business system, structured outputs another system has to consume, multi-language handling beyond what off-the-shelf assistants reliably manage, an auditable trail of decisions for regulatory or contractual reasons, or volume that makes per-seat licensing economically silly. The property management workflow ticks four of those five. Most workflows in regulated UK SMEs — financial services, legal, healthcare, property, professional services — tick at least three.

Off-the-shelf tools remain the right answer for a long tail of workflows that do not need any of the above. Drafting, summarising, brainstorming, contextual retrieval inside Microsoft 365 — Copilot is genuinely the right product for those jobs. The question is not bespoke versus Copilot in the abstract; it is which one matches the shape of the workflow you are trying to change. If yours looks like the property management one — high inbound volume, classification and routing, structured handoff into a system of record, defensible accuracy — a custom AI tool built on the Claude API is usually the right answer, and a paid two-week discovery is the cheapest way to find out for certain. Our how we work page sets out the discovery, build, and operate phases in detail.

FAQ

When the workflow requires routing into a proprietary or line-of-business system, structured outputs that another system has to consume, multi-language handling beyond what off-the-shelf assistants reliably manage, an auditable trail of decisions, or an evaluation harness that proves quality at SLA-grade levels. Copilot and ChatGPT are excellent inside Microsoft 365 or browser-based tasks. They are not designed to be the spine of a regulated, integrated business process. The moment a workflow needs an audit trail, a defined accuracy metric, or to drop a structured record into Dynamics 365, Sage, or a bespoke CRM, the case for a custom AI tool is usually decisive.
For a single workflow of the shape described in this case study — inbound classification, routing, and structured handoff to an existing system — expect a paid two-week discovery in the low-thousands, a five-to-eight-week build between fifteen and forty thousand pounds, and an operate-phase retainer in the low-thousands per month. Running costs for the Claude API on a workload of around 220 emails per day are in the order of a hundred pounds a month. The headline cost is engineering, not inference. Payback in disciplined deployments is typically eight to twelve weeks measured against reclaimed staff time.
For a focused single-workflow build, seven to ten weeks end to end is a realistic range. Two weeks of paid discovery to dissect the workflow, agree the success metric, and examine the data. Five to eight weeks of build, with the evaluation harness in place from week one so that quality is measurable from the first prompt. A one-week phased rollout with a manual review safety net before fully cutting over. A four-to-eight-week build is the headline range we publish on the solutions page, and the property management case study in this post landed at the upper end at seven weeks because of multi-language complexity that was not surfaced in discovery.
The Claude API is the reasoning component inside the workflow. In this build, it receives the inbound email, an extracted set of metadata from Microsoft Graph, and a structured prompt that defines the classification taxonomy and the required JSON output. It returns a structured response with a classification, an extracted issue summary, a recommended priority, and a confidence score. The surrounding code handles authentication, retry, evaluation logging, and routing the structured response into Dynamics 365. Claude does not touch the system of record directly — it produces a structured output that the integration layer trusts only when confidence is high enough, and routes to a human triager otherwise.
A named operational owner on the client side runs the tool day to day, watches a single success metric, and chairs a thirty-minute monthly review. The consultancy provides an operate-phase retainer that covers evaluation harness runs on model or prompt changes, quarterly accuracy audits, and a defined response time for re-prompts when the underlying data drifts. IT supports the integration, identity, and governance layer. Without that ownership split — named operator on the business side, retained operate-phase support from the build partner — custom AI tools decay quietly inside twelve months, the failure pattern detailed in our piece on why AI automation stalls.

Got a workflow that looks like this one?

If you have a workflow with high inbound volume, structured routing, and accuracy that has to be defensible — book a 30-minute discovery call. We will tell you straight whether a custom AI tool is the right answer for it, or whether Copilot is doing the job already. No sales theatre.

Book a Discovery Call