8 data integration challenges and how to overcome them
Data integration in modern architectures faces eight challenges, from preserving lineage at scale to serving AI and analytics workloads, each explored with practical strategies.
Every so often, I meet a CIO who tells me their team has consolidated numerous legacy data sources into a single new platform, and its worst data integration challenges will now be a thing of the past. Unfortunately, the reality is quite different.
More often than not, much of the organization's data resides in disparate systems -- on desktops, in file shares, streaming from various devices and external data collected from the web. The need for complicated data integration never went away.
In fact, almost every organization, regardless of size, needs to integrate data from multiple sources to support business processes and get a consistent view of their current state, customer behavior and ongoing operations. It's not an easy task.
The first step in tackling data integration is to understand how the process fits into your overall data management strategy. For example, an ERP system might be governed and administered by a set of data management policies focused on financial integrity. Meanwhile, the CRM system is governed by the need to comply with customer data privacy regulations. What policies will govern a process that integrates both systems and creates a new, consolidated data set for BI and analytics uses?
Data integration is a vital part of data management, but it comes with plenty of technical challenges. The eight described here are common, especially in modern data architectures that are increasingly diverse and dynamic.
1. Managing data volumes while preserving lineage
Most businesses today recognize data as a valuable asset, but many struggle with the sheer scale of what is available to them. Data storage is relatively cheap, and analytics tools are capable of handling large amounts of data, so where does the problem lie?
Data integration and related disciplines, such as managing data quality, can be a real challenge when data volumes become massive. To add to the complication, regulators increasingly demand audit trails for all data used in decision support or for training AI systems. This is the problem of data lineage, where the path data takes through systems and transformations must be traced. Lineage answers questions like: where did this data come from, what happened to it along the way, and where did it go?
Many data integration tasks that are routinely performed with modest data volumes become more taxing as workloads grow very large. But most of the techniques for handling complex integration processes will also help with large volumes of data. In addition, running smaller, more efficient batch integration jobs and optimizing the integration workflow prevent the pipeline from being held up. But the problem of maintaining lineage through these transformations remains.
One approach is to embed lineage into the architecture from the start by running smaller batch jobs and optimizing workflows to maintain metadata. Where lineage becomes too complex to maintain, it may be acceptable to track data provenance instead. Provenance is closely related to lineage but focused more on where this data originated, who created it and whether we can trust it.
2. Integrating diverse data sources
At one time, extract, transform and load (ETL) and other data integration processes mostly involved a mix of text files and database extracts that contained familiar structured data types. Now, you might also need to integrate streaming data from device logs and online services, including:
- Social media data, including text and images.
- Public data, such as weather data, commodity prices from government websites, or specialized information providers.
- In some sectors, increasing amounts of ecosystem data from customers, partners, suppliers or shippers.
This data diversity complicates integration work, but you can manage it with a careful choice of data platforms and tool sets. A traditional data warehouse built on a relational database handles structured diversity reasonably well, but struggles with data streams and unstructured content. A data lake processes streams and unstructured data effectively, but historically offered weaker guarantees around integrity, transaction support and availability.
Recently, hybrid data lakehouse architecture has become a critical component in the data ecosystem to support analytics, real-time and AI scenarios. A lakehouse adds transactional capabilities and schemas over data lake storage, most popularly built on Apache Iceberg. In this way, a single platform can now serve BI workloads that require strict consistency, data science workloads that require access to raw, fine-grained data and AI workloads that may include very diverse data types.
This convergence simplifies governance because rather than maintaining separate policies for warehouse and lake environments, organizations can enforce access controls, lineage tracking and quality metrics through a unified layer.
3. Hybrid cloud and on-premises environments
Not all large-scale IT workloads have moved to the cloud, and some that did have been repatriated to on-prem storage. Several factors drive this, including the following:
- Cloud systems are flexible and easy to scale up or down, but they aren't cheap. Many enterprises have been surprised by the cost of cloud computing -- they find that the elastic nature of cloud services and the simplicity of scaling lead to inefficient practices that result in usage costs being far more than expected.
- There are increasing regulatory demands for data sovereignty, which refers not only to the physical storage of data, but also to AI models trained over it and who (or what systems) can access those assets. This trend is driven by privacy concerns, but also by worries about the transfer of AI technologies across national boundaries.
- Organizations training their own AI models are often concerned about data exposure, especially when using cloud-hosted services. They may prefer to run smaller models on premises to keep proprietary data and trade secrets under their control.
4. Poor or inconsistent data quality
I often say there's only one measurement of data quality: Is it fit for purpose? But the same data can be used for very different purposes, which makes assessing its quality in that way tricky. Also, key dimensions of data quality, such as accuracy, timeliness, completeness and especially consistency, become more difficult to maintain as data volumes grow and data comes from diverse sources.
That could mean business decisions are based on bad, incomplete or duplicate data. To prevent that, organizations must identify and address data quality issues during data integration, through steps such as data profiling and data cleansing. In a data lakehouse, however, it's advised not to run destructive data integration processes that overwrite or discard the original data, which may be of analytical value to data scientists and other users as is. Rather, ensure the raw data is still available in a separate architectural zone and document quality metrics for each zone so that downstream consumers understand what they are working with.
5. Serving multiple use cases without sacrificing compliance
It's not only data quality that is affected by multiple use cases -- data integration is too. Data scientists often want to build analytical models with fine-grained raw data so they don't miss potential insights, but business analysts more often want to work with aggregated data that's aligned with their best practices. Similarly, data may need to be in different formats to be consumed by different tools. For example, data science tools might optimally use Apache Parquet files, but BI dashboards need to be built through a database connection.
AI workloads demand specific data types and access patterns, including large volumes of labeled data for training, low-latency feature data for inference, and vector embeddings. For these AI workloads, tracking data lineage is crucial to address regulatory constraints on model learning and to ensure data quality. These demands introduce new integration points, in addition to traditional analytics and BI needs, often requiring distinct pipelines for data transformation and delivery.
The secret is not only to understand these different use cases, but also to accommodate them. Don't try to force users to change data platforms or analytics tools or to use a suboptimal connection just to make data integration easier. Instead, look for a data integration platform that supports a wide variety of targets and don't be reluctant to build separate data pipelines for specific use cases.
6. Monitoring and observability
In a data integration context, it's easy to mistake data observability for a more traditional approach that involves careful logging and reporting of integration processes. However, there's more to it than that.
Modern data integration processes -- diverse, distributed and dynamic as they are -- also have numerous points of failure. Very often, data pipelines and the dependencies between them are designed so that no single failure breaks the whole process. This robustness reduces downtime, which used to be the bane of overnight ETL processes that broke too easily and were difficult to restart.
However, if the process as a whole doesn't break but one part of it fails, what is the current state of your data? Monitoring can be a challenge.
This is where data observability comes in. It measures data delivery -- including when the data was last fully processed-- logs all processes and traces both the source and impact of any errors. When you run a report, you can see not only when the data was refreshed, but also which components -- even which rows or cells -- may not be fully up to date or of the best quality.
Observability typically focuses on the following five key attributes of data:
- Is it timely?
- Is it structured as we expect?
- Is it within the expected data quality limits?
- Is the data complete?
- What is the lineage of the data -- where was it sourced from?
In a perfect world, all of our data would be right all the time. In the real world, pay attention to data observability and, if needed, invest in specific tools for this work.
7. Integrating streaming data with event-driven architectures
A stream of data is continuous and unbound -- there's no beginning or end, just a sequence of recorded events. And therein lies the challenge: How do you integrate this open-ended data with more traditional architectures that expect data sets, not streams?
With many devices or streaming services, it's possible to request data in a batch. You could get all the data for the last five minutes, for example. What you receive is a file, structured much like an XML record or a database table. You could then get another one five minutes later, building up a continuous flow of data. This isn't really streaming -- in fact, many call this micro-batching because you're receiving numerous small, discrete batches or data sets.
Likewise, there are integration tools, such as change data capture software, that enable you to query a stream as if it were a batch, often using SQL. This makes data integration pretty straightforward, and I recommend you start here. Handling streams that are interrupted and then restarted remains a challenge. But your chosen tool may offer best practices or specific features that make this catch-up effort, sometimes called rehydrating the stream, easier.
Nevertheless, because you're using a workaround to integrate a stream, data observability is again essential to identify and respond to any issues that arise.
8. Mixed tool sets and architecture
To combat the previous challenges, I've recommended various tools and platforms. In doing so, I've now created another challenge: handling all these tools and the complex architecture that can result from using them.
The good news is that, except in the most complex scenarios, you won't be trying to solve every challenge listed. Also, if your systems are running on a cloud platform from one of the mega-vendors, versions of many of these tools will likely be available in some form. They might not always be best-in-class, but they're a good place to start -- the integration and consistency between them is helpful when starting out.
On the other hand, your choice of a data integration platform might be driven by your need for a very specific set of tools. For example, some data engineers -- particularly in manufacturing companies -- find that only one streaming tool really meets their needs. The platform that supports it best wins the day.
Whatever tools you select, governance should operate at a layer above them. One path forward may be metadata-driven automation: define governance policies as metadata that travels with the data, enforced automatically by whatever tool processes it. In this way, access controls, quality thresholds, retention policies, and lineage requirements are not tied to which tool happens to touch the data; they follow the data itself.
Best practices for managing data integration processes
There's a lot to consider with these challenges, and you'll likely come across more. The following are some best practices that can help make data integration strategies and efforts successful:
- Process data as close to the source as possible, both to minimize data movement and to remove or select out unneeded data for efficiency as soon as possible.
- Document integration processes and catalog the integrated data carefully, so business users and data scientists alike can find what they're looking for. As data integration grows in complexity, your documentation should grow in precision -- and volume, unfortunately. But look at that as an investment in future ease of use, data recovery and high availability, rather than as a burden.
- Treat lineage as a first-class requirement, especially when your data feeds into AI systems. Retrofitting lineage onto pipelines built without it is costly, so build it in from the start.
- Similarly, if your data is used to train or fine-tune AI models, maintain a strict versioning discipline. For both debugging models and regulatory audits, you should be able to reconstruct the exact data set used to train a model at a particular point in time.
- Keep your data integration processing nondestructive in analytical data sets. You never know when users will need to get back to the original data -- also, use cases vary greatly and change over time.
- With that in mind, you may find a data lakehouse architecture most appropriate, but still include a landing zone for raw data, a staging area for temporary data, and zones where you can save integrated, conformed and cleansed data to meet the requirements of data science, BI and AI use cases.
- In general, focus on data integration techniques rather than tools, especially at the design or prototyping stage. Integration tools are helpful, but they can constrain what you build around their own capabilities. In practice, many data engineers or integration developers address challenges such as those involving a lot of code and a limited set of tools. That's not for everyone, of course, but a lot of example code and architectural advice is available from people who have solved similar integration problems already.
- Don't just rely on logging to track and monitor integration processes. Use data observability techniques so the availability and quality of data are documented and well understood by users and data administrators alike.
- Recognize that AI pipelines may fail differently from traditional ETL. A batch job either completes or throws an error, but stale or outdated data, or data with emerging biases, may continue running silently and degrade. Monitoring AI integration points requires attention not only to technical execution, but to data freshness and the quality of inferences or predictions.
Donald Farmer is a data strategist with 30+ years of experience, including as a product team leader at Microsoft and Qlik. He advises global clients on data, analytics, AI and innovation strategy, with expertise spanning from tech giants to startups. He lives in an experimental woodland home near Seattle.