The tsunami of data is continually growing. Organizations interact more with their customers, prospects, partners, suppliers and other stakeholders than ever before. They are more apt to get data from social media, IoT and other new data sources. Most organizations have no problem creating and collecting data, but managing it effectively is another story. In many cases, data silos contain inconsistent data that saps the productivity of business users -- a big blow to their ability to make informed decisions.
The key ingredient that these organizations need is data integration. Unlike in the past, when data integration typically focused on loading an enterprise data warehouse, it often now includes a logical data warehouse environment that encompasses that traditional EDW along with data lakes, analytical sandboxes and data science hubs. And it's not just about extracting and loading structured data into relational databases; it's working with a variety of data structures and databases.
Data integration involves taking data -- often from multiple sources -- and transforming it into meaningful information for business executives, data analysts, data scientists and others. As the need to share the growing volumes and varieties of data increases, turning to commercial data integration platforms is one way to help manage and simplify the process.
What are data integration platforms?
Packaged data integration software began with extract, transform and load (ETL) tools designed to automate efforts to pull data from source systems, convert it into a consistent format and load it into an EDW or other target database. The first generation of ETL tools were simple but expensive code generators with limited functionality. Many of the companies that evaluated these tools found it more effective to develop their own custom integration code. Second-generation ETL software offered more functionality, but was primarily batch-oriented and didn't perform well. Based on those two sets of tools, many IT managers felt that ETL software wasn't worth the cost or the effort to learn, as it wouldn't meet their performance needs.
But, over the years ETL tools have evolved in several key areas, including development, operational processing and integration functionality. To make them a more viable development platform, ETL vendors added support for code management, version control, debugging and documentation generation. For operational processing, the tools now have built-in functionality such as error handling, recovery and restart, runtime statistics and scheduling. Data integration tools can also gather, transform and load a mix of structured, semi-structured and unstructured data.
As the industry gained experience and sophistication in data integration, a set of best practices were developed into ETL tools as prebuilt transformations. These transformations include mechanisms for change data capture, slowly changing dimensions, hierarchy management, data connectivity, data merging, reference lookups and referential integrity checks. Data integration performance has increased significantly by utilizing memory, parallelism and various data transport architectures.
In addition, a variant of ETL tools emerged called extract, load and transform (ELT). These tools eliminate the need for a separate application server dedicated to ETL and can be deployed at either the data sources or target systems based on their capacity and configurations. The ELT approach lets users store raw data as-is and then transform all or subsets of it as needed for specific business intelligence and analytics applications.
ETL tools evolve into data integration platforms
Data integration used to focus on core ETL functions for loading EDWs, data marts and BI data stores such as online analytical processing cubes and columnar databases. However, the needs of data integration have now expanded to include the following tasks:
- big data integration
- B2B integration
- cloud integration
- application and business process integration
- data migration
- data consolidation
- data quality and cleansing
- master data management
As that happened, the following integration categories emerged, targeting specific uses and technologies:
Enterprise application integration (EAI). Often referred to simply as application integration, this subcategory, which supports interoperability among different applications, is enabled through web or data services created using service-oriented architecture and industry standards such as electronic data interchange. An enterprise service bus is a common architectural approach to implementing EAI functionality.
Big data integration. This technology focuses on loading data into NoSQL databases and Hadoop, Spark and other big data platforms. Each category of NoSQL database -- wide column, key value, graph and document -- has different integration interfaces and use cases that need to be accommodated by integration tools. With Hadoop data integration, processes typically interface with various Hadoop distribution components such as Spark, MapReduce, Hadoop Distributed File System, HBase, Hive, Pig and Sqoop. Processing engines like Spark are also increasingly used apart from Hadoop, with corresponding integration needs.
Enterprise messaging system (EMS). This technology focuses solely on providing messaging among disparate applications using structured formats such as XML and JSON. EMS tools offer a lightweight integration service that can effectively provide real-time data updates from different data sources.
Enterprise information integration. EII -- initially known as data federation -- provided a virtual view of disparate data sources, but had limited integration capabilities. The current generation, called Data virtualization software, provides both data abstraction and data services layers to a wide variety of sources, including structured, semi-structured and unstructured data.
Cloud-based integration. Also referred to as integration platform as a service, cloud-based integration emerged to provide real-time interoperability between cloud-based applications and databases. These tools deploy as a cloud service, offering EAI and EMS functionality.
Eventually, vendors put the various pieces together and began offering full-fledged data integration suites that provide hybrid capabilities spanning ETL, application integration, cloud-based integration, real-time integration and data virtualization, as well as data cleansing and data profiling tools. The suites can support data integration processes in traditional batch mode or in real or near real time through the use of web services. They can also handle both on-premises and cloud data and less structured information -- system logs, text and other forms of big data, for example -- along with structured transaction data.
Dispelling data integration tool myths
With correct use, data integration platforms greatly improve user productivity and integration flexibility, scalability and expandability over custom manual coding (see sidebar, "Tool-based data integration development vs. hand coding"). However, IT workers writing SQL scripts, or businesspeople using spreadsheets, still widely do hand coding. There are several reasons why IT groups believe they should manually write code rather than use a data integration platform; however, these beliefs are usually based on the following misconceptions:
Integration tools are too expensive. There's a lingering perception left over from the early days of ETL that expensive tools are the only choice, but many data integration platforms priced for cost-sensitive budgets are now available.
Highly skilled resources are required. Another false perception is that an enterprise looking to use commercial software needs data integration developers experienced in the legacy ETL tools that required extensive skills rather than the newer, easier to use data integration platforms.
Coding is cost-free. There's an inherit bias for the IT staff to generate SQL code; they know SQL and can create code in it quickly, and there are no license or subscription costs. However, what starts out as a simple SQL script can quickly snowball into numerous scripts or stored procedures, creating a hodgepodge of often-undocumented integration processes. Making changes to that code takes longer as it gets more complex, consuming more and more resources just to maintain it.
Tool-based data integration development vs. hand coding
Tool-based data integration development provides the following benefits:
- reusable processes based on industry best practices;
- comprehensive data quality processes;
- workflow, error handling and restart and recovery functionality;
- self-documentation of processes and entire workflow;
- the enablement of data governance; and
- impact analysis and where-used (lineage) functionality.
The data integration platform market
There are a variety of data integration platforms available, but IBM, Informatica, Talend, Oracle, SAP, SAS Institute Inc. and Information Builders lead the market. Several of the traditional application integration tools have broadened their capabilities so they overlap data integration tools. In addition, vendors now offer a category of data integration tools called data preparation tools, which are geared toward data analysts and data scientists.
All of these vendors sell data integration products that deploy on premises, but will integrate data that resides on premises or in the cloud. Also, both Talend and Hitachi Vantara's Pentaho platform offer open source versions of their products along with paid-for enterprise versions. Microsoft is unique in that it bundles its data integration product with its databases rather than selling it separately.
Data integration continues to primarily be an IT-centric activity based on the data, database and technology expertise necessary. Typically, IT groups responsible for BI and data warehouse systems also manage data integration, along with data quality, master data management and other data management programs. These groups should have the skills and experience to successfully use the integration platforms. Some leading-edge enterprises, with multiple integration use cases and separate IT groups addressing those uses, have created integration competency centers to manage their data integration platforms from an enterprise-wide perspective in an effort to avoid data silos.