Train AI on Company Data Under GDPR: Legal?

Short answer: yes, in most cases you can lawfully train AI on your own company and customer data in the Netherlands, but only if you have a valid lawful basis, you stay within the purpose you originally collected the data for, and you protect any personal information involved. GDPR (called the AVG in Dutch) does not ban AI training. It governs personal data, no matter what you do with it, and AI training is just another processing activity that has to follow the same rules you already live by. The catch is that most company datasets contain at least some personal data, and "we already have this data" is not, on its own, a legal reason to feed it into a model.

Let me unpack what actually matters, in plain terms, for a business owner or manager who wants to build something useful without stepping on a rake.

First, separate personal data from everything else

Not all company data is regulated. GDPR/AVG only applies to personal data, meaning information that relates to an identifiable living person. A spreadsheet of machine sensor readings, anonymous transaction totals, product specifications, or internal process logs with no names attached generally falls outside GDPR entirely. You can train on that fairly freely, subject to confidentiality and any contracts you signed.

The moment your data includes names, email addresses, customer IDs, support tickets, call transcripts, CVs, or anything that can be traced back to a person, you are in scope. And here is the part people underestimate: data you think is anonymous often is not. A "de-identified" customer record can frequently be re-identified by combining it with other fields. True anonymisation under GDPR is a high bar. Pseudonymised data, where you swap names for codes but keep a key somewhere, is still personal data.

So step one is honest data mapping. What's in your dataset, who does it relate to, and how easily could someone be singled out? This is exactly the unglamorous groundwork that makes or breaks a project, and it overlaps heavily with getting your data into shape technically. We've written separately about whether your data is AI-ready and what data engineering for AI actually involves.

You need a lawful basis, and "we own the data" isn't one

Under GDPR, every act of processing personal data needs one of six lawful bases. For AI training on existing business data, two come up most often:

Legitimate interest (Article 6(1)(f)). You can process data because you have a genuine business interest, provided that interest isn't overridden by the rights and reasonable expectations of the people involved. This is the workhorse basis for internal analytics and many AI projects. It requires you to actually do and document a balancing test, often called an LIA.
Consent (Article 6(1)(a)). Sometimes the cleanest route, especially for anything sensitive or unexpected. But consent must be freely given, specific, and revocable, which makes it brittle for large historical datasets where you never asked.

Contract, legal obligation, vital interests, and public task are the other four, and they're situational. The key shift in thinking: owning a database does not grant you the right to use it for any purpose. The lawful basis attaches to a purpose, not to the file sitting on your server.

The purpose limitation trap

This is where most "but it's our own data" projects get into trouble. GDPR's purpose limitation principle says you can only use personal data for the purpose you collected it for, or a purpose compatible with it.

You collected customer email addresses to deliver orders and provide support. Is training a churn-prediction model a compatible new purpose? Possibly. Is selling a model trained on those records, or using them to build a product you market to others? Much harder to justify, and likely a new purpose that needs its own basis or fresh consent.

The honest test is whether your customers would be reasonably surprised. If a customer would read your AI use and think "that's roughly what I'd expect a company like this to do with my data," you're usually on safer ground. If they'd be startled, treat it as a new purpose. The same logic applies whether you're doing classical machine learning for business or fine-tuning a large language model.

Special-category data raises the bar sharply

Some data is treated as extra-sensitive under Article 9: health, ethnicity, religion, political views, trade-union membership, sexual orientation, biometric and genetic data. Processing these is prohibited by default, with a narrow set of exceptions (usually explicit consent or a specific legal provision).

If you run a clinic, an HR platform, or anything touching health or biometrics, you cannot simply train on that data because you have it. This is the category where Dutch regulators, and the Autoriteit Persoonsgegevens specifically, pay close attention. Get specialist advice before going anywhere near it.

Watch where the data goes during training

How you train matters as much as what you train on.

If you fine-tune or send data to a third-party model provider, that data leaves your control. You need to know: where are their servers, who can see the data, is it used to train their base models, and is there a proper data-processing agreement in place? Transfers outside the EU/EEA carry their own GDPR conditions.

Pull quote: "We already have this data" is not, on its own, a legal reason to feed it into a model. — Crux Digits

This is one reason a lot of EU businesses prefer architectures that keep data in-house. Retrieval-augmented generation, for instance, often lets you ground a model in your documents without baking that data permanently into model weights, which can be cleaner from a privacy standpoint. If those terms are new, our explainers on what RAG is and RAG versus fine-tuning lay out the trade-offs, and the production AI stack post covers how the pieces fit together in a system you can actually run and audit.

Six rights you have to be able to honour

Whatever you build, the people in your data keep their GDPR rights, and your AI system has to be able to respect them:

Access — they can ask what data you hold and how it's used.
Erasure — the "right to be forgotten." This is genuinely awkward for AI, because deleting someone from your database doesn't remove their influence from an already-trained model. Plan your retraining or data-handling approach so deletion requests are actually deliverable.
Rectification — correcting inaccurate data.
Objection — the right to object to processing based on legitimate interest.
Restriction — pausing processing in certain disputes.
Automated-decision safeguards (Article 22) — if the AI makes decisions with legal or similarly significant effects on people (loans, hiring, insurance), they have the right to meaningful human involvement and an explanation.

That last point is one reason we build with a human in the loop by default. It's not just good practice, it's frequently a legal requirement, and it pairs well with knowing your model's limits, including why models hallucinate and how LLMs actually generate answers.

A practical sequence before you train anything

If you want a checklist that won't bury you in legal theory, this is roughly the order we work through with clients:

Map the data. What personal data is in there, relating to whom, and how identifiable.
Minimise. Strip out fields the model doesn't need. Less personal data, less risk, often a better model.
Pick and document your lawful basis for the training purpose specifically. Run the balancing test if you're relying on legitimate interest.
Check purpose compatibility. Would your customers expect this use? If not, get a fresh basis.
Run a DPIA (Data Protection Impact Assessment) for higher-risk processing, large-scale data, or special categories. The AVG effectively requires it in those cases.
Lock down the pipeline. Encryption, access controls, processing agreements with any vendors, and clarity on where data physically sits.
Build in the rights. Make sure access, correction, and deletion requests are operationally possible from day one.

None of this is exotic. It's the same discipline that makes an AI project work rather than just demo well, which is why solid data handling and solid engineering tend to travel together. It's also why we usually start with a small, scoped piece of work rather than a giant leap, the same logic behind scoping a proof of concept properly.

The EU AI Act sits on top of all this

One more layer worth flagging. GDPR governs the data. The EU AI Act, phasing in across 2025 and 2026, governs the system, with stricter obligations for "high-risk" uses like hiring, credit scoring, and certain biometric applications. The two regimes overlap but aren't the same thing. A use that's fine under GDPR might still carry AI Act obligations, and vice versa. For most ordinary internal-efficiency tools the AI Act is light-touch, but if you're building anything that materially affects people's lives, both regimes apply.

So, can you do it?

For the large majority of Dutch businesses wanting to train or ground AI on their own operational and customer data, the answer is a confident yes, lawfully — once you've identified the personal data, chosen a defensible lawful basis, stayed within a compatible purpose, protected the data in transit and at rest, and made the underlying rights deliverable. The illegal version is the lazy one: grabbing whatever's on the server, shipping it to a random model provider, and hoping nobody asks. The lawful version is mostly just good engineering and a bit of paperwork done up front.

If you'd rather not navigate the data-mapping, lawful-basis, and architecture decisions alone, that's the kind of thing we do. Our data engineering and AI consulting work is GDPR-first and human-in-the-loop by design, and a fixed-scope AI Audit & Strategy is usually the right first step to find out exactly where you stand. Feel free to get in touch for a straight answer about your specific situation.

This article is general information, not legal advice. For decisions involving sensitive data or high-risk use cases, consult a qualified data-protection lawyer.

Frequently asked questions

Can I train AI on customer data I already have without asking again?

Sometimes, but not automatically. Already holding the data does not give you a legal basis to use it for a new purpose like AI training. You need a valid lawful basis (often legitimate interest, with a documented balancing test, or fresh consent) and the new use has to be compatible with why you originally collected the data. If customers would be surprised by the use, treat it as a new purpose.

Is anonymised data exempt from GDPR/AVG?

Truly anonymous data, where no individual can be identified even by combining datasets, falls outside GDPR and can be used much more freely. But the bar for real anonymisation is high, and many 'anonymised' datasets can be re-identified. Pseudonymised data (names swapped for codes, with a key kept somewhere) is still personal data and remains fully in scope.

What is the safest lawful basis for AI training on company data?

There is no single safest basis; it depends on the data and purpose. Legitimate interest is the most common workhorse for internal AI and analytics, but it requires a documented balancing test weighing your interest against people's rights. Consent is cleaner for sensitive or unexpected uses but is brittle for large historical datasets. Special-category data (health, biometrics, etc.) usually needs explicit consent or a specific legal exception.

How does the right to be forgotten work if data is already in a trained model?

This is one of the trickier parts of AI under GDPR. Deleting someone from your database does not automatically remove their influence from a model that was already trained on it. You need a plan, such as scheduled retraining, data-handling controls, or architectures like retrieval-augmented generation that avoid baking personal data permanently into model weights, so that erasure requests can actually be honoured.

Do I need a DPIA before training AI on customer data?

Often yes. A Data Protection Impact Assessment is effectively required under the AVG for higher-risk processing, large-scale data, systematic profiling, or any special-category data. Even when not strictly mandatory, running one is good practice because it forces you to document your lawful basis, minimise the data, and identify risks before you build rather than after a complaint.

Is It Legal to Train AI on Your Company Data? (GDPR, Explained)