We Connected an LLM to a 12-Year-Old Codebase. Here's What Broke.

Every "add AI to your product" tutorial assumes you are starting fresh. Greenfield repo, clean data, no users yet. Real integration work looks nothing like that.

Last year our team picked up a fintech client with a loan-application platform that had been running since 2014. Node.js backend, a Postgres database that three different teams had touched, and a checkout flow that processed real money every few seconds. The ask sounded simple: use an LLM to pre-screen loan applications and flag the risky ones for a human.

It was not simple. Here is what broke, in the order it broke, and the pattern that finally held.

Break #1: The Synchronous Call That Took Down Checkout

The first version was the obvious one. A developer added the LLM call directly into the application-submission handler. Application comes in, call the model, get a risk score, continue.

// The version that looked fine in the demo
async function submitApplication(application) {
  const validated = validateApplication(application);
  const riskScore = await llmClient.scoreRisk(validated); // <-- new line
  await db.saveApplication({ ...validated, riskScore });
  return { status: "submitted" };
}

It worked in the demo. It worked in staging. Then the model provider had a slow afternoon, response times went from 800ms to 19 seconds, and every loan submission hung. The LLM call was now a hard dependency in the middle of a money flow. No timeout, no fallback. A third-party hiccup became our outage.

The lesson is not "LLMs are unreliable." The lesson is that we treated a probabilistic, network-bound, third-party service like a local function call. Your existing code was built around deterministic, fast, in-process logic. An LLM is none of those things.

Break #2: The Data Layer Nobody Audited

Once we fixed the timeout, the model started returning confident, well-formatted, completely wrong risk scores.

The cause was not the model. It was the data. The applications table had three columns that all sort of meant "annual income," populated by different intake forms over a decade. Some were monthly figures. Some were strings with currency symbols. The model dutifully reasoned over whatever it got and produced garbage with total confidence.

We spent more time cleaning and reconciling that data than we spent on the actual model integration. That ratio surprised the client. It should not surprise anyone who has done this before. If your data has a decade of drift, the integration project is a data project wearing an AI hat.

Break #3: The Cost Telemetry We Added Too Late

The pilot looked cheap. A few thousand applications a day, a few cents each. Then someone enabled the feature for a second product line without telling us, volume tripled overnight, and the model bill for that month arrived looking like a typo.

Nobody was watching per-call cost. We had logging for latency and errors because those page someone at 3am. Cost just accumulates quietly until finance asks a pointed question. We added per-call cost tracking after the fact, which is the most expensive time to add it.

The Pattern That Finally Held

We stopped putting the LLM inside the application code. We put a gateway in front of it.

// The version that survived production
async function submitApplication(application) {
  const validated = validateApplication(application);

  // AI scoring is now optional, async, and isolated
  const riskScore = await aiGateway.scoreRisk(validated, {
    timeoutMs: 1200,
    fallback: () => rulesBasedScore(validated), // deterministic backup
  });

  await db.saveApplication({ ...validated, riskScore });
  return { status: "submitted" };
}

The gateway is a thin service that sits between our application and the model. It owns four things the application should never have owned:

Timeouts and circuit breaking. If the model is slow, the gateway gives up fast and the request falls back to the old rules-based score. Checkout never hangs again.
A deterministic fallback. A wrong-but-instant score beats a perfect score that arrives after the user gave up.
Cost and usage telemetry. Every call is metered. A spike triggers an alert, not a surprise invoice.
An audit trail. Every score is logged with the input, the model version, and the final human decision. For a regulated lender, that log is not optional.

The application code does not know or care that an LLM is involved. It calls aiGateway.scoreRisk() the same way it calls anything else. The model can be swapped, upgraded, or disabled entirely behind that interface without touching the money flow.

That single architectural decision, made on roughly day 47 instead of day 1, is the one I would undo if I could. We have not had an AI-related outage in the months since.

Why This Keeps Happening

This is not a niche mistake. Gartner forecasts that over 40% of agentic AI projects will be canceled by the end of 2027, and the usual causes are not bad models. They are escalating costs, unclear value, and weak risk controls. All three are integration problems.

Meanwhile the pressure to ship is real: Gartner also expects 40% of enterprise applications to feature task-specific AI agents by the end of 2026. So teams bolt a model into a handler, demo it, and ship. The demo never shows you the third-party slow afternoon.

What We'd Do Differently

If we restarted this project knowing what we know now:

Audit the data before writing any model code. A one-week data inventory would have caught the three-income-columns problem before it produced a single wrong score.
Put the gateway in on day one. It is four extra days of work up front. It paid for itself the first time the provider had a slow afternoon.
Add cost telemetry with the first call, not the first invoice. Meter it before you need it.
Pick a narrow, measurable pilot. "Flag the risky 5% for human review" is testable. "Use AI in underwriting" is not.

We wrote up the full version of this as a six-step framework, with the integration patterns, the data-readiness checklist, and the build-versus-buy math: how to integrate AI into your existing systems without breaking production. The section on choosing an integration pattern is the part I wish we had read first.

Wrapping Up

AI integration rarely fails at the model. It fails at the seam where the model meets software that was designed before the model existed. Keep the AI on its own side of an API contract. Give it timeouts, a fallback, telemetry, and an audit trail. Treat the data layer as the real project.

We are the team at Empiric Infotech, and we build AI integrations into mobile apps, fintech platforms, and clinical tools. If you have a war story of your own, drop it in the comments. I would genuinely like to read it.

推荐订阅源

DEV Community