Datafold raises $20M for data reliability engineering
Datafold's founder and CEO details the data observability challenges the startup is looking to address with its suite of data tools that provide visibility into data pipelines.
Data reliability engineering startup Datafold said on Nov. 9 it raised $20 million in a Series A round of funding to help the vendor build out its technology and go-to-market efforts.
Based in San Francisco, Datafold was founded in 2020 with a goal to expand visibility into data pipelines to help organizations improve data quality as well as reliability.
The company's founder and CEO, Gleb Mezhanskiy, has experience working in data engineering roles, including a stint at ride-sharing provider Lyft, where he realized that data reliability tooling needed to be more closely integrated with the development and operation architectures in use at modern organizations.
Datafold's data reliability platform includes data catalog, data lineage and monitoring capabilities alongside a Data Diff tool for regression testing of data pipelines.
In this Q&A, Mezhanskiy explains what data reliability engineering is and why the vendor is growing.
Why are you now raising a Series A?
Gleb Mezhanskiy: Over the past few years, we have worked with a few select early adopters. Some of them were really large companies with large data teams and some of them were smaller companies in very particular domains, for example, health tech or fintech. We started seeing really consistent ways of how the product was used.
We believe that probably about 80% of what our users are doing day to day with their data workflows should be automated, and we can really be playing a big part in that. So the fundraising was done with a view to expand our products and to build more automation to help our users be more productive and to also be more sophisticated in how we help them detect issues and triage data issues.
What data reliability engineering challenge led you to start Datafold?
Mezhanskiy: In my experience, one theme that consistently has been a bottleneck was not just how to build a data pipeline or a dashboard, but how to make sure that whatever the insights we deliver are actually reliable.
Organizations are dealing with larger volumes and varieties of data and it has become increasingly harder to maintain data reliability and quality.
At Lyft, I was a data engineer on call and responsible for making sure that all the calculations that needed to happen overnight went smoothly. One night, I had to make a very small fix to some code that was processing data. I changed about four lines of SQL code, following the company's process for code review. The next day, when I came to work, everything was broken and the data dashboards were looking really weird.
Luckily for me, I wasn't fired on the spot and actually, they put me in charge of building tools to prevent the exact mistakes that I made from happening again. So we built lots of very powerful tooling to test data, detect anomalies and help developers and data engineers at Lyft build faster. But then I realized that this kind of tooling that I built internally would eventually be needed by any data team that is building pipelines and dashboards.
Gleb MezhanskiyFounder and CEO, Datafold
Pretty much the idea of starting Datafold is to enable every data team out there with tooling that would allow them to move fast with high confidence and deliver reliable data products.
How is data reliability engineering different than just data observability?
Mezhanskiy: Alerting has been the focus of many data observability vendors and that is about detecting issues that have already happened in production. While that is useful, because if there is a fire, you want to put it out, the challenge is that by the time you detect issues, likely the damage is already done.
So when we thought about observability, we know that data teams need to know how the data flows throughout their pipelines, and they need to know what anomalies are happening.
Our largest focus is on answering the question: How can we help teams not have data quality issues in production in the first place? How can we detect things before they get to production?
We want to position ourselves as a platform that supports the practice of data reliability engineering.
What do we want to do with our customers is basically find the ways in which data teams should really be evolving their data reliability practices, just like site reliability engineering happened in software. It's not just about providing data teams with tools but also helping them implement better processes and better culture internally in the data team.
Editor's note: This interview has been edited for clarity and conciseness.