Manufacturing Data Engineering for AI: Why the Foundation Comes First
Manufacturing data engineering for AI rarely makes the headlines. Predictive maintenance algorithms, computer vision quality inspection, and real-time production optimisation get the coverage. What gets far less attention is the layer beneath all of those use cases: the pipelines, connectors, data lakes, and transformation logic that move raw machine signals from the shop floor into a form that any AI model can actually use.
Without that foundation, the AI simply does not work. A predictive maintenance model trained on incomplete sensor data will miss faults or fire false alarms. A quality-inspection system fed inconsistent image metadata will struggle to generalise across product variants. A production scheduling optimiser that cannot see real-time machine status will generate plans that conflict with reality within hours of being produced.
This post is about that unglamorous foundation — what it consists of, why it is harder than it looks in a manufacturing environment, and how Dutch factories can build it in a way that is genuinely fit for AI. Crux Digits builds shop-floor data pipelines, SCADA/MES/IIoT integrations, and data lake architectures for manufacturers across the Netherlands and across Europe.
What Data Infrastructure Is Needed to Run AI on a Factory Floor?
This is the central question that manufacturing IT leaders, operations technology managers, and digital transformation leads ask when they begin to take industrial AI seriously. The honest answer involves several layers, and each layer has its own challenges.
Layer 1: Data sources. A modern factory floor generates data from a wide range of systems. Supervisory Control and Data Acquisition (SCADA) systems collect sensor readings, machine states, alarms, and control signals — often at high frequency, sometimes hundreds of tags per second. Manufacturing Execution Systems (MES) capture production orders, work-in-progress tracking, quality check results, and operator inputs. Enterprise Resource Planning (ERP) systems hold materials, orders, recipes, maintenance schedules, and cost data. Industrial Internet of Things (IIoT) devices — vibration sensors, thermal cameras, flow meters, smart actuators — add further streams of operational telemetry. These sources were typically built at different times, by different vendors, using different protocols and data formats, with no thought given to AI workloads.
Layer 2: Connectivity and integration. Getting data out of these systems reliably is non-trivial. SCADA systems communicate via OPC-UA, OPC-DA, Modbus, Profinet, or proprietary vendor protocols. MES and ERP systems may expose REST or SOAP APIs, database connections, or flat-file exports. IIoT devices may use MQTT, AMQP, or custom cloud connectors. Each connection must be established, secured, and monitored. Connectivity is the first bottleneck that most AI projects hit — the data exists, but extracting it at the right frequency and in the right format requires dedicated engineering work.
Layer 3: Storage and the data lake. Once data is flowing, it needs a home. An industrial IoT data lake for manufacturing must handle high-ingestion-rate time-series data from sensors alongside lower-frequency transactional data from MES and ERP. The architecture needs to support both historical batch queries — training a predictive model on twelve months of vibration data — and real-time or near-real-time queries for live dashboards and inference pipelines. Getting this architecture right from the start avoids painful and expensive refactoring later.
Layer 4: Data quality and transformation. Raw shop-floor data is rarely clean. Sensors drift. Network dropouts create gaps in time-series. Unit conversions are inconsistent across lines or sites. Timestamps may not be synchronised. Alarm floods can overwhelm ingestion pipelines. MES records may have incomplete or incorrectly coded quality outcomes. All of this must be addressed before a model sees the data — through validation rules, imputation strategies, normalisation, and feature engineering pipelines that can be maintained as the data evolves.
Layer 5: Governance and security. OT networks are not IT networks. They were designed for reliability and safety, not for the data-sharing patterns that AI workloads require. Bridging OT and IT — connecting a SCADA historian to a cloud data lake, for example — introduces security risks that must be managed carefully. Data governance must cover who can access which data, how long it is retained, and how it flows between systems and organisational boundaries.
Crux Digits works with manufacturers to scope and build all five layers, through our data engineering and AI implementation services, starting with the layer that is actually the binding constraint for each client.
OT/IT Integration: The Toughest Part of the Shop-Floor Data Pipeline
The term shop floor data pipeline factory sounds straightforward until you encounter the reality of operational technology (OT) environments. Factory automation systems were designed with a different set of priorities from enterprise IT: continuous availability, deterministic real-time response, and functional safety. The concept of a cloud-connected data pipeline that streams sensor readings to a central lake was simply not part of the original design.
This creates several practical challenges:
- Air-gapped or semi-isolated OT networks. Many factory floors operate on networks that are deliberately isolated from corporate IT and the internet, for security and stability reasons. Getting data out requires carefully designed DMZ (demilitarised zone) architectures or one-way data diodes that preserve the isolation while allowing data to flow to IT systems.
- Legacy equipment with no native connectivity. Older CNC machines, injection moulders, and assembly robots may have no network interface at all, or may communicate only via RS-232 serial or proprietary fieldbuses. Retrofitting connectivity — through edge gateways, protocol converters, or retrofitted sensor kits — is often necessary before any data pipeline can be built.
- Real-time constraints. Some AI use cases — anomaly detection for process control, vibration analysis for bearing fault detection — require near-real-time data with latency measured in milliseconds to seconds, not minutes. Others, such as shift-level production analytics or batch quality reporting, are tolerant of delays. The data architecture must accommodate both, often in the same pipeline.
- Vendor lock-in and proprietary formats. SCADA vendors, PLC manufacturers, and MES providers have historically used proprietary data formats and APIs that make extraction difficult. OPC-UA has improved standardisation significantly for newer equipment, but legacy estates remain heterogeneous.
- Change management and availability. On a factory floor, any change to a live system — even a read-only data tap — must be approved and managed carefully to avoid disrupting production. This slows integration work relative to a typical enterprise IT project.
Crux Digits approaches OT/IT integration with a pragmatic, phased methodology: starting with read-only connections to existing historians and MES exports, validating data quality and coverage, and then progressively adding higher-frequency or real-time streams as confidence grows. This avoids the common mistake of attempting a full integration programme before the AI use cases have been validated.
SCADA Data AI Integration: What It Actually Involves
SCADA data AI integration is one of the most frequently requested capabilities in industrial AI projects. SCADA systems are the primary source of continuous machine telemetry — temperature, pressure, flow, speed, vibration, current draw — that predictive maintenance and process optimisation models depend on. Yet connecting a SCADA historian to a modern AI platform involves more than a database query.
Most SCADA historians — OSIsoft PI (now AVEVA PI), Ignition, Wonderware/AVEVA System Platform, Siemens WinCC, or equivalent — store data in a compressed, proprietary time-series format optimised for retrieval, not for bulk export. Extracting data at the frequency needed for AI model training requires careful handling of the historian's query interface, respect for its performance limits, and incremental extraction logic that can resume after interruptions without creating gaps or duplicates.
Once extracted, SCADA data requires substantial transformation before it is model-ready. Tag names are typically engineering identifiers (e.g., LINE3.PRESS_01.PV) that carry no semantic meaning without a tag dictionary. Tags must be mapped to assets, locations, and process variables using the plant's P&ID documentation or asset hierarchy. Time-series gaps must be handled — either interpolated, forward-filled, or flagged as missing depending on the use case. Alarm and event records must be parsed and aligned with the continuous sensor streams they relate to.
The output of this pipeline is a cleaned, labelled, asset-linked time-series dataset that a machine learning model can be trained on. The pipeline itself must be maintainable: when new equipment is added, when tag naming conventions change, or when the SCADA system is upgraded, the integration must adapt without manual rework on every downstream model.
MES and ERP AI Data Integration: Connecting the Production Record
MES ERP AI data integration adds the production context that machine telemetry alone cannot provide. Sensor data tells you what the machine was doing; MES data tells you what it was supposed to be doing, what product variant was running, who was operating it, and what the quality result was. ERP data adds the material lot, the customer order, the recipe version, and the maintenance history.
Without this context, AI models trained purely on machine signals are limited. A vibration anomaly means different things depending on whether the machine was running at 60% speed on a prototype batch or at 100% speed on a high-volume production run. A quality defect rate spike means different things depending on whether raw material lot A or lot B was in use. The richest AI models for manufacturing combine telemetry with production context — and that requires integrating MES and ERP data into the same platform as the sensor data.
MES integration typically involves connecting to the MES via its API or database interface to extract work-order records, quality checkpoint results, and operator event logs. ERP integration may involve SAP, Microsoft Dynamics, or a sector-specific ERP system, each with its own integration patterns. The key engineering challenge is joining these records to the sensor time-series on the correct temporal keys — matching the machine state during production order X to the sensor readings during that production window, accounting for clock skew between systems.
Crux Digits has built MES/ERP integration layers for manufacturers using SAP S/4HANA, Oracle Cloud ERP, and sector-specific MES platforms. Browse our case studies for examples. Our broader manufacturing practice covers the full stack from shop-floor connectivity to AI model deployment.
Real-Time Analytics Manufacturing AI: Batch vs Stream Processing
One of the most important architectural decisions in a real-time analytics manufacturing AI platform is where to draw the line between batch and stream processing. Getting this decision right has significant implications for cost, complexity, and the AI use cases that become possible.
Batch processing collects data over a defined interval and processes it together — hourly, daily, or per-shift. It is simpler to build and maintain, works well for historical model training, and suits use cases where the value of the insight does not depend on immediacy: shift-level OEE reporting, weekly predictive maintenance risk scoring, or monthly quality trend analysis. Most manufacturers can achieve significant AI value with batch-first architectures, especially in early-stage deployments.
Stream processing processes data continuously as it arrives, enabling insights with latency measured in seconds rather than hours. It is required for use cases where timely action matters: detecting a bearing anomaly before it causes unplanned downtime, alerting an operator to a process parameter drift before it results in out-of-spec product, or dynamically adjusting machine settings in response to real-time quality feedback. Stream processing is significantly more complex to build, operate, and debug than batch processing. It should be adopted when the use case genuinely requires it, not as a default architectural preference.
A pragmatic approach for most Dutch manufacturers is to begin with a batch-capable data lake architecture that can be extended with stream processing for specific high-value use cases. This avoids over-engineering the foundation before the AI use cases have been validated, while preserving the option to add real-time capabilities as the programme matures.
Technologies commonly used in industrial data platforms include Apache Kafka or MQTT brokers for stream ingestion, Delta Lake or Apache Iceberg for time-series storage with ACID guarantees, Apache Spark or dbt for batch transformation, and Databricks, Microsoft Fabric, or open-source equivalents for the broader platform. Crux Digits is vendor-neutral: we select the technology stack that fits the client's existing infrastructure, skills, and budget, not the stack that a particular vendor is promoting.
Data Quality: The Silent Killer of Manufacturing AI Projects
Most manufacturing AI projects that fail do not fail because the algorithm was wrong. They fail because the data was not good enough — or because nobody checked before the model was trained.
Common data quality issues in manufacturing environments include:
- Sensor drift and calibration gaps. A temperature sensor that has drifted by 3°C over six months produces systematically biased training data. If the drift is not identified and corrected, the model will learn the wrong baseline.
- Missing data and dropouts. Network interruptions, PLC resets, and historian purges create gaps in time-series data. How those gaps are handled — interpolation, forward-fill, flagging, or exclusion — must be an explicit engineering decision, not an accidental default.
- Label quality for supervised learning. Predictive maintenance models need labelled examples of failure events. If maintenance records in the CMMS are incomplete, inconsistently coded, or entered retroactively, the labels will be unreliable. Poor labels produce poorly calibrated models regardless of how good the sensor data is.

- Clock synchronisation across systems. SCADA, MES, and ERP systems may have clocks that differ by minutes or more. When joining records across systems on timestamps, even small clock skew can cause temporal misalignment that corrupts the training dataset.
- Inconsistent units and naming. Pressure measured in bar in one line and PSI in another; speed measured in RPM in SCADA and as a percentage of setpoint in the MES. These inconsistencies must be resolved explicitly in the transformation layer, not left to the model to figure out.
Crux Digits includes a data quality assessment as a standard deliverable in every manufacturing AI engagement. We instrument the pipeline with automated quality checks, generate data quality reports before model training begins, and help clients establish ongoing monitoring so that data quality issues are detected operationally rather than discovered when model performance degrades. Our machine learning services are built on the assumption that the data engineering must be right before the modelling begins.
OT Network Security: The Non-Negotiable Constraint
Connecting shop-floor systems to a data platform — whether on-premises or cloud-based — changes the security posture of the OT environment. This is not a reason to avoid the connection; it is a reason to design it carefully.
Key security principles for shop-floor data integration:
- Read-only by design. Data pipelines from OT systems should be read-only wherever possible. A data tap that can only read from the SCADA historian cannot write commands to PLCs, even if compromised. This architectural principle limits the blast radius of any security incident.
- Network segmentation. OT networks should remain segmented from corporate IT and the internet. Data should flow from OT to IT via a controlled interface — a data diode, a historian-to-cloud connector, or an edge gateway sitting in a DMZ — not through a direct connection between the plant floor network and the corporate WAN.
- Encrypted transport. All data in transit between OT systems and the data platform should be encrypted, even within the plant. TLS or equivalent should be standard, not optional.
- Minimal credentials. Integration accounts used to extract data from SCADA or MES should have the minimum permissions needed for the specific query, not broad database or system-administrator rights.
- Monitoring and alerting. Unusual data flows — unexpected query volumes, connections from new source IPs, schema changes — should be monitored and alerted on. OT environments have historically lacked this kind of observability; adding it as part of the data integration project is a worthwhile investment.
The NIS2 Directive, which came into force across EU member states in 2024, has increased regulatory obligations for cybersecurity in manufacturing and critical infrastructure. Dutch manufacturers subject to NIS2 must implement appropriate security measures for their OT environments. Crux Digits designs data integration architectures that comply with these requirements from the outset, rather than adding security as an afterthought.
Starting Focused: The Right Way to Begin a Factory Data Programme
One of the most common mistakes in manufacturing data engineering is trying to build the entire platform before validating a single AI use case. The ambition is understandable — if you are going to invest in a data lake, you want it to serve every use case — but the execution risk is high. Large platform programmes that are decoupled from specific business outcomes tend to stall, run over budget, or deliver infrastructure that does not quite match the actual needs of the AI use cases that eventually emerge.
A better approach is to start focused:
- Identify the one or two AI use cases where the business value is clearest and the data sources are most tractable. Predictive maintenance on a specific critical asset, or real-time quality monitoring on a specific line, are typical starting points.
- Build the minimum data infrastructure needed to prove that use case: connect the relevant data sources, implement the necessary transformations, validate data quality, and train a baseline model.
- Measure the business impact of the first use case — avoided downtime, reduced scrap, operator time saved — and use that evidence to justify the next phase of platform investment.
- Design the initial architecture to be extensible: choose storage formats, transformation patterns, and governance approaches that will scale, even if the first deployment is modest in scope.
- Build in data quality monitoring from day one, so that the discipline of measuring and improving data quality is established before the portfolio of AI models grows.
This focused approach produces faster time-to-value, lower initial risk, and a clearer link between data engineering investment and business outcomes. It also tends to build the internal capability and confidence that makes subsequent phases easier to execute.
Crux Digits offers scoping engagements specifically designed for manufacturers who are at the beginning of this journey. We map the data landscape, identify the highest-value use cases, assess data readiness, and produce a prioritised roadmap. Review our pricing page for guidance on engagement structures, or get in touch directly to discuss your situation.
The Manufacturing AI Data Engineering Checklist
Before investing in AI models for the factory floor, work through the following foundation questions:
- Which data sources are relevant for the target AI use case — SCADA, MES, ERP, IIoT, CMMS — and are they accessible?
- What protocols do those sources use, and what integration work is required to connect them?
- Is the OT network architected to allow data to flow to an IT or cloud platform securely?
- What is the historical data retention in the SCADA historian, and is it sufficient for model training?
- How complete and consistently coded are the maintenance and quality records that will be used as labels?
- Are timestamps synchronised across all relevant systems?
- Is there a team — internal or external — with OT/IT integration experience, not just general software engineering?
- Is the target use case genuinely better served by real-time streaming data, or will batch suffice?
- What data governance and security requirements apply to shop-floor data in your organisation and sector?
- How will data quality be monitored operationally, not just validated once at project start?
How Crux Digits Builds Shop-Floor Data Foundations
Crux Digits is a vendor-neutral AI consultancy based in Utrecht, working with manufacturers across the Netherlands and the wider EU. We do not sell a proprietary IIoT platform or a pre-packaged MES connector. We design and build the data infrastructure that is the right fit for each client's specific equipment estate, OT environment, AI use cases, and organisational constraints.
Engagements typically begin with a data readiness assessment: we inventory the relevant data sources, map their protocols and access methods, sample the data to assess quality, and identify the gaps that need to be addressed before AI model training can begin. From that assessment we produce a prioritised architecture recommendation and a phased implementation plan.
The build phase covers OT/IT connectivity (SCADA historian taps, MES API integrations, IIoT broker configuration), data lake architecture and storage layer setup, transformation and feature engineering pipelines, data quality instrumentation, and security and governance controls. We deliver documented, maintainable pipelines — not one-off scripts that only the original developer can modify.
Once the data foundation is in place, our AI implementation team builds and deploys the models that depend on it. Our machine learning practice covers predictive maintenance, quality inspection, process optimisation, and production scheduling. We also offer data engineering as a standalone service for manufacturers who want to build the platform foundation independently before engaging on AI.
The result is a shop-floor data platform that your AI models can actually rely on — and that your operational technology team can maintain and extend as your manufacturing processes evolve.
Frequently Asked Questions
Frequently asked questions
How long does it take to build a shop-floor data pipeline ready for AI?
It depends on the complexity of the data sources, the state of OT/IT connectivity, and the target AI use case. A focused pipeline connecting a SCADA historian and an MES for a single predictive maintenance use case can be scoped, built, and validated in eight to sixteen weeks. A broader platform covering multiple sites, real-time streams, and several AI use cases will take longer and is best delivered in phases. Crux Digits starts every engagement with a data readiness assessment to give an accurate estimate before any build commitment is made.
Can existing SCADA and MES systems be connected to a cloud data lake without disrupting production?
Yes, when done carefully. The standard approach uses read-only connections to existing SCADA historians and MES exports — tapping into data that is already being generated, without writing to or modifying live control systems. Changes to OT environments require approval through the plant change management process, but a well-designed read-only data tap can typically be implemented without any production downtime. Network architecture — DMZ, data diodes, encrypted transport — must be designed to preserve the isolation of the OT network throughout.
What is the difference between a SCADA historian and a modern industrial data lake?
A SCADA historian is purpose-built for storing and retrieving time-series process data efficiently, typically using proprietary compression and storage formats optimised for the SCADA vendor's own query tools. A modern industrial data lake is a general-purpose, scalable storage platform — typically built on open formats such as Delta Lake or Apache Iceberg — that can hold SCADA time-series data alongside MES records, ERP data, image data from vision systems, and any other structured or unstructured data relevant to AI workloads. The data lake is queryable by a wide range of tools and supports the bulk data access patterns that machine learning model training requires, which most SCADA historians were not designed to handle efficiently.
Does a factory need cloud infrastructure to run AI, or can it be done on-premises?
Both architectures are viable. Cloud-based data lakes and AI platforms offer scalability, managed services, and lower upfront infrastructure investment, but require the OT data to leave the plant network — which introduces security and data governance considerations that must be managed. On-premises or edge-based architectures keep data within the plant and can support real-time inference with very low latency, but require capital investment in servers and ongoing maintenance. Hybrid approaches — where raw data stays on-premises but processed features or aggregated datasets are moved to the cloud for model training — are increasingly common. Crux Digits designs the architecture that best matches the client's security requirements, network constraints, and budget.
How does Crux Digits approach data quality for manufacturing AI projects?
Data quality assessment is a standard deliverable in every manufacturing AI engagement Crux Digits undertakes. Before any model training begins, we instrument the data pipeline with automated validation checks — completeness, consistency, timestamp synchronisation, unit normalisation — and produce a data quality report that identifies gaps and the actions needed to address them. We then help clients establish ongoing monitoring so that data quality is tracked operationally, not just validated once at project start. This approach ensures that models are trained on data that accurately reflects the process, and that data quality issues are caught early rather than discovered when model performance unexpectedly degrades in production.