Home / Insights / The Modern Production AI Stack: RAG, Fine-Tuning, Evaluation and LLMOps
Technical

The Modern Production AI Stack: RAG, Fine-Tuning, Evaluation and LLMOps

A demo that answers three questions in a meeting room is not a system. Turning a language model into something a business can depend on means assembling a production AI stack: a set of layers that move data in, ground the model in your reality, orchestrate its actions, and then prove — continuously — that it still behaves. This article is a technical, vendor-neutral tour of that stack as it looks in 2026, with concrete guidance on RAG, fine-tuning, evaluation and LLMOps. It is written for engineering and product leaders who want to know how to build a production-ready AI stack, not just which model is trending this quarter.

The model itself is the smallest part of the work. Most of the reliability, cost control and compliance lives in the layers around it. So rather than naming brands, we will name categories — because the right vector store or evaluation tool changes, but the shape of the architecture does not.

The layers of a production AI stack

Think of the stack as seven layers, each with a clear job. The discipline is to treat them as separable concerns rather than one monolithic prompt.

1. Data ingestion and pipelines

Everything downstream depends on clean, current, well-structured data. This layer pulls from documents, databases, APIs and event streams; parses and chunks content; normalises metadata; and keeps it fresh through incremental syncs rather than one-off dumps. Chunking strategy, deduplication, and capturing source and timestamp metadata are not glamorous, but they decide whether retrieval later returns the right passage or a stale one. Strong data engineering is the unglamorous foundation of the whole stack — it is the focus of our data engineering work.

2. Vector storage, retrieval and RAG

Retrieval-augmented generation (RAG) is how you give a general model your specific, private knowledge without retraining it. Text is embedded into vectors, stored in a vector index, and at query time the most relevant chunks are retrieved and placed into the prompt as grounded context. In practice, naive vector search is rarely enough: production RAG combines dense (vector) and sparse (keyword) retrieval in a hybrid setup, adds a re-ranking step to push the best passages to the top, and frequently uses query rewriting so the user's phrasing maps onto how documents are actually written. Done well, RAG cuts hallucination, lets you cite sources, and updates the moment your data changes — no model retraining required.

3. The model layer: prompting vs RAG vs fine-tuning

Here is where teams most often over-engineer. There are three distinct levers, and they are not interchangeable. Prompting (including few-shot examples and structured system prompts) shapes behaviour with zero training cost and is the right first move for most tasks. RAG injects knowledge the model never saw. Fine-tuning changes the model's weights to teach a consistent style, format or narrow skill. The decisive insight is that these solve different problems, which is why the RAG vs fine-tuning question deserves its own section below. Above all of this sits model selection and routing — sending easy requests to a small, cheap model and hard ones to a larger model is one of the highest-leverage cost moves available.

4. Orchestration and agents

Real workflows rarely fit in a single call. The orchestration layer chains steps, calls tools and APIs, manages memory, and decides what happens next. At the simpler end this is a deterministic pipeline; at the more autonomous end it is an agent that plans and acts across systems. Agentic patterns are powerful but raise the engineering bar — they need permissions, error handling and human-in-the-loop checks on anything sensitive. We go deeper on this in AI agents in production, and on where genuine autonomy is warranted in what are AI agents.

5. The evaluation harness

You cannot improve what you cannot measure, and LLM outputs are non-deterministic. A serious stack treats evaluation as first-class: a curated test set of representative inputs and expected behaviours, automated scoring (exact-match where possible, LLM-as-judge with care, retrieval metrics for RAG), and regression tests that run on every prompt or model change. This is the difference between shipping on vibes and shipping on evidence — the heart of LLM evaluation and observability.

6. Observability, tracing and LLMOps

Once live, you need to see inside the system. LLMOps brings DevOps discipline to language models: tracing every request end to end (retrieved chunks, prompts, tool calls, tokens, latency, cost), logging outputs, monitoring quality and drift, and capturing user feedback so production data flows back into your eval set. Good observability turns a vague "it feels worse this week" into a specific, fixable trace.

7. Security and governance

This layer is not optional in Europe. It covers PII detection and redaction, access control on what the model and its retrieval can see, prompt-injection and output guardrails, full audit logs, and alignment with the EU AI Act and GDPR. For Dutch and European organisations, knowing exactly what data your stack touches — and being able to evidence it — is a precondition for production, not a finishing touch.

RAG vs fine-tuning for enterprise: a clear decision

This is the most common architecture question we hear, and the honest answer is usually "RAG first." Reach for RAG when the problem is knowledge: the model needs facts it never learned — your policies, products, tickets, contracts — especially when that knowledge changes often and you need citations. Reach for fine-tuning when the problem is behaviour: a consistent tone, a strict output format, a specialised classification, or shaving latency and cost by teaching a smaller model a narrow skill it can then do without a long prompt. The two are complementary, not rival: a mature system often fine-tunes for form and uses RAG for facts. What you should almost never do is fine-tune to inject changing knowledge — it is expensive, it goes stale, and RAG does it better. This trade-off is core to our LLM optimisation practice.

Evaluation, guardrails and cost/latency trade-offs

Three forces pull against each other in every production decision: quality, latency and cost. A bigger model and more retrieved context raise quality but also latency and spend; aggressive caching and a smaller routed model cut cost but can dent quality. There is no universal answer — only the answer your evaluation harness proves for your task. Guardrails sit alongside: input validation against prompt injection, output checks for PII and policy, and human review on high-stakes actions. The teams that win treat these as measurable engineering parameters, tuned with data, rather than fixed beliefs. This evidence-led posture is exactly what we mean by design-first AI.

Build vs buy, and where things are heading

You will not build every layer. The pragmatic pattern is to buy commoditised infrastructure — model APIs, a managed vector store, observability tooling — and build the parts that are your differentiator: your data pipelines, your retrieval logic, your evaluation set and your guardrails. Owning your eval harness and your data layer is what keeps you portable as models and vendors churn. For a wider view of how these patterns are evolving, see where AI is heading in 2026.

How Crux Digits assembles the stack

We are Crux Digits B.V., a Utrecht-based AI consultancy and software studio, and we build this stack in three deliberate steps rather than one big bet. An AI audit and strategy (typically around €2,500) maps your data, use cases and risks and recommends the leanest architecture. A focused proof of concept (around €20,000) ships a working, evaluated slice — real RAG, real metrics — in weeks. From there, production engagements (from €50,000) harden it into a monitored, governed system. Transparent scope lives on our pricing page, illustrative work in our case studies, and you can book a free consultation to map your first use case. The goal is the same every time: not a demo, but a stack you can trust in production.

Frequently asked questions

What is a production AI stack?

A production AI stack is the full set of layers that turn a language model into a dependable system: data ingestion and pipelines, vector storage and retrieval (RAG), the model layer, orchestration, an evaluation harness, observability and LLMOps, and security and governance. The model is the smallest part; most reliability, cost control and compliance live in the surrounding layers.

RAG vs fine-tuning: which should an enterprise use?

Use RAG when the problem is knowledge the model never learned and that changes often — your policies, products or contracts — especially when you need citations. Use fine-tuning when the problem is behaviour: a consistent tone, strict output format or a narrow skill on a smaller, cheaper model. They are complementary, and you should almost never fine-tune to inject changing knowledge, because RAG does it better and stays current.

What is LLMOps and why does it matter?

LLMOps brings DevOps discipline to language models: tracing every request end to end, logging outputs, monitoring quality and drift, controlling cost and latency, and feeding production data and user feedback back into your evaluation set. Because LLM outputs are non-deterministic, observability is what lets you catch regressions and improve a live system with evidence rather than guesswork.

How do I evaluate an LLM application before going live?

Build an evaluation harness: a curated test set of representative inputs and expected behaviours, automated scoring (exact-match where possible, careful LLM-as-judge, and retrieval metrics for RAG), and regression tests that run on every prompt or model change. This turns shipping on vibes into shipping on evidence and protects quality as you iterate.

How does the EU AI Act affect a production AI stack?

For Dutch and European organisations, governance is a precondition for production, not a finishing touch. Your stack needs PII detection and redaction, access control over what the model and its retrieval can see, prompt-injection and output guardrails, and full audit logs — all aligned with the EU AI Act and GDPR. You must be able to evidence exactly what data the system touches and how its outputs are used.

Want any of this applied to your business?

We turn these concepts into working tools — grounded, safe and measurable. Start with a free consultation.

Book a free consultation →