Every few months, I meet a CIO who tells me their team has consolidated numerous legacy data sources into a single new platform and its worst data integration challenges will now be a thing of the past. Inevitably, a few weeks later, I hear a familiar story -- it turns out that much of the organization's data didn't reside only in those particular source systems. It also had data on desktops, in file shares and streaming from various devices, plus external data collected from the web. The need for complicated data integration never did go away.
In fact, almost every organization of any size needs to integrate data from multiple sources to support business processes and get a consistent view -- that's as complete as possible -- of the current state of the business, customer behavior, ongoing operations and more. It's not an easy task.
The preliminary step in tackling data integration is to understand how the process fits in your overall data management strategy. For example, your ERP system might be governed and administered by a set of data management policies focused on financial integrity. Meanwhile, your CRM system is governed by the need to comply with customer data privacy regulations. What policies will then govern a process that integrates both systems and creates a new, consolidated data set for BI and analytics uses?
Data integration is a vital part of data management, but it's important to be aware of the possible implications before you dive into integrating data. Once you are, there are plenty of technical challenges to face. The eight described here are common ones, especially in modern data architectures that are increasingly diverse and dynamic. But I won't just raise eight problems -- I'll also give pointers on how to tackle them in principle.
1. Ever-increasing data volumes
Can you have too much of a good thing? Most businesses today recognize data as a valuable asset, but many struggle with the sheer scale of what is now available to them. Data storage is relatively cheap and analytics tools are very capable of handling large amounts of data, so where does the problem lie? Data integration, and related disciplines such as managing data quality, can be a real challenge when data volumes become massive.
For example, joining very large data sets to find matching values uses a lot of system resources and is often a slow process. Other operations, such as large data aggregation and sorting jobs, can also be resource-intensive. Temporary storage and other techniques you might use to help streamline things can be logistically challenging even when the costs aren't prohibitive.
In short, many of the data integration tasks routinely performed offhand with modest data volumes become more taxing with very large workloads. But most of the techniques described in this article for handling complex integration processes will also help with large volumes of data. In addition, you could run smaller, more efficient batch integration jobs and optimize the integration workflow so operations don't block or hold up the pipeline.
2. Diverse data sources
At one time, extract, transform and load (ETL) and other data integration processes mostly involved a mix of text files and database extracts that contained familiar structured data types. Now, you might also need to integrate streaming data from device logs and online services; social media data that includes text and images; public data, such as weather data or commodity prices downloaded from government websites or specialized information providers; and, in some sectors, increasing amounts of ecosystem data from your customers, partners, suppliers, shippers and so on.
This data diversity complicates integration work, but you can manage it with a careful choice of data platforms and tool sets. A traditional data warehouse built on a relational database can handle some of the diversity but is unlikely to have the right tools for integrating data streams or parsing information from images and sentiment from text. On the other hand, the combination of a well-designed data lake and Apache open source technologies integrated with your data pipelines might handle the streams and unstructured data, but it can't match a SQL database for ensuring data integrity and availability.
As a result, your answer is unlikely to be a one-size-fits-all solution. You need to understand in detail which elements of your diverse data assets are going to cause the most integration trouble and possibly process them separately, while handling the majority of your data with the most robust tools you can find.
3. Hybrid cloud and on-premises environments
It's easy to assume that nearly all large-scale IT workloads have moved to the cloud. They haven't. Although cloud systems are flexible and easy to scale up or down, they aren't necessarily cheap. Many enterprises have been surprised by the cost of cloud computing -- they find that the elastic nature of cloud services and the simplicity of scaling lead to inefficient practices that result in usage costs being far more than expected.
For this reason, some companies still run workloads in on-premises systems. Some have also never moved mission-critical workloads to the cloud, preferring to keep them close and manage them directly. Others have legacy systems that simply can't be hosted in the cloud.
In these cases, data engineers and data integration developers almost inevitably find themselves mixing cloud and on-premises data, and the question arises: Where to do the data integration?
In part, the answer might be imposed by policy or strategy. If you prefer to keep mission-critical data on premises, you'll typically download your cloud data to integrate with it there. But in many cases, you only need to integrate a subset of some data -- perhaps a partition of a data set defined by date, or a SELECT statement in SQL for a limited number of columns or rows in a database. If so, it may be possible to upload that data to the cloud for integration with data stored there. As a rule of thumb, though, you want to minimize data movement to reduce the opportunities for integration errors and failures.
4. Poor or inconsistent data quality
I often say there's only one measurement of data quality: Is it fit for purpose? But the same data can be used for very different purposes, which makes assessing its quality in that way tricky. Also, key dimensions of data quality such as accuracy, timeliness, completeness and -- especially -- consistency do become more difficult to ensure as data volumes grow and data comes from diverse sources.
That could mean business decisions are based on bad, incomplete or duplicate data. To prevent that, data quality issues must be identified and addressed during data integration, through steps such as data profiling and data cleansing. In a data lake, though, my advice is to not run destructive data integration processes that overwrite or discard the original data, which may be of analytical value to data scientists and other users as is. Rather, ensure the raw data is still available in a separate zone of the data lake.
5. Multiple use cases
It's not only data quality that is affected by multiple use cases -- data integration is, too. Data scientists often want to build analytical models with fine-grained raw data so they don't miss potential insights, but business analysts more often want to work with aggregated data conformed to their best practices. Similarly, data may need to be in different formats to be consumed by different tools. For example, data science tools might optimally use Apache Parquet files, but BI dashboards need to be built through a database connection.
The secret is not only to understand these different use cases, but also to accommodate them. Don't try to force users to change data platforms or analytics tools or to use a suboptimal connection just to make data integration easier. Instead, look for a data integration platform that supports a wide variety of targets and don't be reluctant to build separate data pipelines for specific use cases.
6. Monitoring and observability
Observability is a newish IT term. In a data integration context, it's easy to mistake it for a more traditional approach that involves careful logging and reporting of integration processes. However, there's more to data observability than that.
Modern data integration processes -- diverse, distributed and dynamic as they are -- also have numerous points of failure. Very often, data pipelines and the dependencies between them are designed so no single failure breaks the whole process. This robustness is admirable and saves a lot of downtime, which used to be the bane of overnight ETL processes that broke too easily and were difficult to restart.
However, if the process as a whole doesn't break but one part of it fails, what is the current state of your data? Monitoring that can be a challenge.
This is where data observability comes in. It measures data delivery (including when the data was last fully processed), logs all processes and traces both the source and impact of any errors. When you run a report, you can see not only when the data was refreshed, but also which components -- even which rows or cells -- may not be fully up to date or of the best quality. Observability tends to focus on five key attributes of data.
- Is it timely?
- Is it structured as we expect?
- Is it within the expected data quality limits?
- Is the data complete?
- What is the lineage of the data -- where was it sourced from?
In a perfect world, all our data would be right all the time. In the real world, pay attention to data observability and perhaps even invest in specific tools for this work.
7. Handling streaming data
A stream of data is continuous and unbound -- there's no beginning or end, just a sequence of recorded events. And therein lies the challenge: How do you integrate this open-ended data with more traditional architectures that expect data sets, not streams?
With many devices or streaming services, it's possible to request data in a batch. You could get all the data for the last five minutes, for example. What you receive is a file, structured much like an XML record or a database table. You could then get another one five minutes later and so on, building up a continuous flow of data. This isn't really streaming -- in fact, many call this micro-batching because you're receiving numerous small, discrete batches or data sets.
Likewise, there are integration tools, such as change data capture software, that enable you to query a stream as if it were a batch, often using SQL. This makes data integration pretty straightforward, and it's where I recommend you start. There are still problems with how you handle streams that are interrupted and restarted, but you might find the tool you've chosen at least has some best practices, if not specific features, that make this catch-up effort -- sometimes called rehydrating the stream -- easier.
Nevertheless, because you're using a workaround to integrate a stream, data observability again is essential to ensure you can identify and respond to any issues that arise.
8. Mixed tool sets and architecture
For the previous challenges, I've made some recommendations about specific types of tools or platforms that can be used to tackle specific data integration problems. In doing so, I've now created another challenge for you: handling all these tools and the complex architecture that can result from using all of them.
The good news is that, except in the most complex scenarios, you won't be trying to solve every challenge listed. Also, if your systems are running on a cloud platform from one of the mega-vendors, versions of many of these tools likely will be available in some form. They might not always be best-in-class, but they're a good place to start -- the integration and consistency between them is helpful when starting out.
On the other hand, your choice of a data integration platform might be driven by your need for a very specific set of tools. For example, some data engineers -- particularly in manufacturing companies -- find out that only one streaming tool really meets their needs. The platform that supports it best wins the day.
Best practices for managing data integration processes
There's a lot to consider in these challenges, and you'll likely come across more. But here are some best practices that can help ensure data integration strategies and efforts are successful.
- Process data as close to the source as possible, both to minimize data movement and to remove or select out unneeded data for efficiency as soon as possible.
- Document integration processes and catalog the integrated data carefully, so business users and data scientists alike can find what they're looking for. As data integration grows in complexity, your documentation should grow in precision -- and volume, unfortunately. But look at that as an investment in future ease of use, data recovery and high availability, rather than as a burden.
- Ensure your data integration processing is nondestructive in analytical data sets. You never know when users will need to get back to the original data -- also, use cases vary greatly and change over time.
- With that in mind, structure data lakes to include a landing zone for raw data, a staging area for temporary data and zones where you can save integrated, conformed and cleansed data to meet the requirements of data science and BI applications.
- In general, focus on data integration techniques rather than tools, especially at the design or prototyping stage. Integration tools are helpful, but they can also limit use cases to their own specialized focus. In practice, many data engineers or integration developers address challenges like these with a lot of code and a limited set of tools. That's not for everyone, of course, but a lot of example code and architectural advice is available from people who have solved similar integration problems already.
- Don't just rely on logging to track and monitor integration processes. Use data observability techniques to ensure the availability and quality of data are documented and well understood by users and data administrators alike.