kentoh - Fotolia
Integrating data from one location with another has long been a primary challenge of data management.
Data is typically located in multiple locations and formats and bringing it all together in a format that helps organizations is difficult. Among the various approaches to enterprise data integration are data virtualization, in which data stays in its original location, and data loading into a centralized location.
A standout among vendors in the data integration market is Talend, based in Redwood City, Calif. Talend has grown its suite of tools in recent years. It acquired Stitch Inc. in 2018, bringing in a new data loader tool to the company's portfolio to complement its own extract, transform and load platform offerings. The vendor has continued to develop the Stitch technology since the acquisition and is pushing forward on broader efforts to enable data trust as well.
In this Q&A, Laurent Bride, CTO and COO of Talend, outlines the evolution of the company in recent years and provides insight into the state of data integration today.
What's the difference between data loading with Stitch and Talend's other data integration tools?
Laurent Bride: Stitch is really focused on quick data ingestion. So it's more of an extract and load into a data store and it doesn't do advanced transformation. It doesn't do the governance of data or apply any kind of data quality rules. For data transformation, you would use the other piece of the Talend portfolio. But when it comes to very quickly landing data from various sources into your analytical storage and data engine, Stitch does that tremendously quickly and effectively.
Why should organizations take a data loading approach, instead of data virtualization?
Bride: There are vendors out there that do pure data virtualization; we are not one of them.
We used to have a trend in the market where people were saying, 'OK, you're not going to move the data, it stays where it is. Then you're going to create a federation layer, where you can run your queries and never leave the actual data sources.'
What we've seen in our market is that customers actually bring all their data, because storage is so cheap, and right now all the cloud vendors are providing very powerful solutions.
For organizations that are already bringing all their data into a data lake and need to transform the data, there is the concept of the Spark pipeline. With Talend you can build Spark pipelines that you would deploy where your data resides in the data lake, and that's where you do all your transformation. We see more and more customers are adopting that kind of pattern.
What has changed at Talend over the six years you've been with the company?
Laurent Bride: Well, when I joined Talend in 2014 it was the beginning of the big data era and we already had a MapReduce code generator. We made the decision back then to deliver a new code generator and if you recall those days, every other month, there was a new data processing framework coming up. We picked Spark.
Laurent BrideCTO and COO, Talend
We got lucky and Spark really dominates the data processing world when it comes to data lake scenarios. I've also seen Spark getting more into the structured world, whereas the initial Spark release was more about unstructured data processing.
Cloud also been a big revolution. If you had asked me six years ago, 'Will public cloud dominate big data in 2020?' I would probably have said, probably no, we will still have a big chunk of the business that would be private cloud, fully managed service kind of thing. But clearly public cloud has taken the world of big data by storm.
What do you see as the biggest challenges organizations face with data integration?
Bride: I think one of the recurring challenges that we have is around data quality. Data quality is a very difficult topic. It was difficult when customers were dealing with terabytes of data and it is even more difficult now that customers are dealing with petabytes of data.
Data quality is something that organizations struggle with. Whether they can trust the data in whatever analytics tools they are looking at, they need to be able to understand where the data came from and what kind of transformations happened to the data along the way.
Also something that we hear a lot from our customers are questions about how to make data an asset and how to expose that data or start sharing that data through APIs, with some ecosystem partners, so it's not just data within the company. That's what we see more and more, because the ecosystem is more connected than ever.
Editor's note: This interview has been edited for conciseness and clarity.