When something goes wrong with a data pipeline, it's often difficult for organizations to figure out what happened and how to fix it.
That's the challenge that Monte Carlo, based in San Francisco, is looking to solve with its Incident IQ technology, which the vendor added to its Data Observability Platform on July 14.
The new Incident IQ capability helps organizations identify data pipeline problems such as an outages or latency and provides insight that can help remediate the root cause. The Monte Carlo Data Observability Platform works with data from a variety of sources including data lakes, data warehouses and event streaming.
Among the data observability startup's users is online consignment and thrift shop ThredUp, based in Oakland, Calif.
Satish Rane, head of data engineering at ThredUp, said the firm uses data from multiple sources including transactional and behavioral data from consumers, as well as data that comes from the company's operations.
ThredUp also relies on streaming data via Apache Kafka. He said that when he joined ThredUp a year ago, he noticed that that a number of different groups at the company were introducing data, beyond just the data team.
"We had a decentralized approach to onboarding data," Rane said. "There are sacred things which the data team owns, which are critical for the finance side, and there are all these other on-boarders of data that probably do not go through the same regimen of what the data engineering team goes through."
The challenge for Rane and his team was understanding the source of data, as well as its structure and quality. There were also a lot of data pipelines that the company was running where it wasn't entirely clear where they were being used, whether it was for analytics, business intelligence or operations within the company.
"With Monte Carlo, first of all what it did was really give us a pulse of our data, whether the data made sense and if something was not right," Rane said.
Monte Carlo Incident IQ boosting data observability
ThredUp has been testing the Incident IQ feature in the Monte Carlo Data Observability Platform and for Rane, it has been a positive experience so far.
"From the data engineering side, you look at the incident, and then all in one place you are able to see everything, like upstream and downstream dependencies and what was the root cause, right down to the piece of code," Rane said.
Lior Gavish, CTO of Monte Carlo, explained that typically when a data pipeline is broken, it requires a certain amount of time to find the problem and then even more time to figure out how to fix it.
Gavish said that Monte Carlo's platform had previously provided visibility into the health of data pipelines. With Incident IQ, Monte Carlo is going beyond spotting problems to providing more insight to help users quickly fix problems.
Satish RaneHead of data engineering, ThredUp
How Incident IQ works to improve data observability
The Monte Carlo platform collects data about all the data sources and pipelines an organization is using.
The system captures event logs of what queries ran as well as what data transformation and ETL (extract, transform, load) processes have been executed. Monte Carlo also collect statistics about all the data to help provide a full view of what data is being used.
Broadly, three factors can create a data incident, according to Gavish. The first is an unexpected change in the data source itself; that change could be a change in formatting or in content. The second factor is any kind of error with the code that is used to configure the data pipeline, such as incorrect code formatting. The third is a change the operational environment, such as a change in network configuration or user permissions.
With its insight into all the sources of data, combined with machine learning and anomaly detection, Gavish said Monte Carlo is able to help organizations pinpoint data pipeline incidents.
With the precise information on what's wrong, Gavish said it's possible for data engineers to fix the root cause. As for fixing data glitches, Monte Carlo is not able to automate that currently. Users must still manually adjust the configuration or data that led to the data problem.
Gavish said that over the next year, Monte Carlo aims to develop capabilities to enable automatic remediation data pipeline problems.
For Monte Carlo CEO and co-founder Barr Moses, Incident IQ is another step on her company's mission of helping companies benefit more from data by minimizing data downtime.
"Data downtime is a term that we use to describe periods of time when your data is wrong or inaccurate," she said. "We think the problem of data downtime is something that's becoming more and more important for companies that are using data to drive decision-making."