Sergey Nivens - Fotolia
Data confidence a problem data wrangling aims to resolve
Joe Hellerstein, co-founder and CSO of Trifacta, talks about data wrangling and the ongoing trend of migration to the cloud, among other data management topics.
Data confidence is a problem for many organizations.
Enterprises collect massive amounts of data and then put it away in a data warehouse until the time comes when they want to package it and turn it into actionable insights. Often, however, their data confidence is low.
Because there's so much data, some organizations don't know whether the data quality is high or not. And if an organization is lacking data confidence, it may be hesitant to act based on the data.
Data wrangling software aims to address the lack of data confidence. Data wrangling vendors provide platforms that sift through data to find high quality, relevant information and then transform the data into a digestible format so it can be analyzed.
Trifacta, based in San Francisco and founded in 2012, is one such data wrangling vendor aiming to improve data confidence. And recently, after hearing concern from its customers about data quality -- particularly with the complication of cloud migration thrown into the mix -- the vendor conducted a survey of 646 data professionals, the findings of which it released on Jan. 23 in a report: "Obstacles to AI and Analytics Adoption in the Cloud."
Key discoveries were that 66% of respondents said all or most of their analytics and augmented intelligence and machine learning initiatives were running in the cloud but that 75% aren't confident in the quality of their data. In addition, 46% reported that data inaccuracy is halting AI projects.
Recently, Joe Hellerstein, a longtime computer science professor at the University of California, Berkeley as well as a co-founder and CSO of Trifacta, answered a series of questions related to data confidence and data quality.
In Part I of a two-part Q&A he discussed data wrangling and the cloud migration trend he's seen pick up speed in the last couple of years. In Part II, he talks about Trifacta's report as well as how data wrangling can improve data confidence.
Trifacta recently released its report 'Obstacles to AI & Analytics Adoption in the Cloud.' What led you to conduct the research in the first place, and what problems were you seeing from users of your data wrangling software that motivated the research?
Joe Hellerstein: I think we had an atmospheric sense from our customers that A) there was a lot of activity in the cloud and we wanted to understand and quantify that, and B) that there was a lot of anxiety around data quality that we were hearing from customers, and again we wanted to quantify that and localize that, and this was an effort to put some numbers around the feeling we were hearing that were qualitative and get some more quantitative results.
So do the findings imply that there's a lack of data confidence in the cloud?
Joe HellersteinCo-founder and CEO, Trifacta
Hellerstein: I think there's widespread belief that the cloud as a platform is extremely viable, that it can do anything you want it to do except perhaps [obscure mathematics], and some legislative regulation, so in terms of the capability of the compute of the software there's high confidence that the cloud is a very good place to do the next generation of work. In the report, I think where you see a lot of concern is around the quality of data in these organizations that would empower an AI-driven approach. There, it's a matter of work to be done. It doesn't so much matter if it's in the cloud or on premises, but the key findings here are that 75% of C-suite respondents are not confident in the quality of their data.
There's a roadmap for getting to a successful machine learning and AI strategy. Part of that roadmap is infrastructural, so a shift from on premises to cloud, part of it is operational and, in terms of the skill sets and processes in the data handling, that's where this issue comes up. You really need to have a much bigger focus on data quality and assessment of data assets, and the whole data wrangling becomes super critical as you try to transition to being a data-driven business.
How does Trifacta and other data wrangling and data management software improve data confidence at the executive level?
Hellerstein: I think the proof is in the pudding. That proof of success of automated data processing is in metrics around what's being driven by that data. For example, if you're driving recommendations on a retail site, you want to know what the rate of adoption of those recommendations is, and is it improving over time? If you're looking at lead generation, a marketing campaign, you can measure how many of those leads turn into opportunities and you can see if you're improving over time. In a healthy, data-driven organization, you're trying to get on a flywheel of improvement, and those end metrics which are business metrics are what the C-suite cares about. The other thing you can do if the technology is appropriate is you can start to get eyeballs on the data in a way that can actually be communicated to decision-makers where you can start talking about dashboards and data quality in the large as well as in specific data sets, and with the appropriate representations of data quality those can be communicated to a relatively non-technical or strategic audience.
But the bottom line is, are you helping the business, and I think that's where the key issue is.
Is there anything else you'd like to add about either data wrangling, data confidence or cloud migration in general?
Hellerstein: Workloads in the cloud are a bit different than traditional data processing workloads, so that might deserve a little comment.
The first applications that people want to do data analysis in the cloud are typically applications that are using net new data sources. I think the report speaks of some 90-plus percent of all data being generated in the last year or two, so there are huge volumes of data and a lot of it is coming log files, it's coming from [the internet of things], it's not necessarily coming through the traditional transactional database. What that means is that this data, on the one hand, is very messy and complicated, and it's changing quickly as the software generates it, and it's changing in an agile way. At the same time it's the kind of data where you can really extract a signal from noise. Where traditionally we were concerned with a single source of truth in the enterprise for transactional data and want to know every line item -- when it was purchased and by whom -- this is data where I'm just trying to get the signal from the noise in order to drive predictive processes like fraud detection, like targeted marketing.
What does that lead to?
Hellerstein: So in this setting there's this interesting coupling of ample, very complex data and noise-tolerant technologies like machine learning that's really powerful and interesting and different from what we used to do. The data wrangling around that becomes much more content-oriented -- am I getting good statistics out of this data -- and a little less persnickety about whether Row 47 of the master table is right, for example.
These are both important trends -- whether you can get the data directionally right in order to do prediction and whether you can get the reference data right. Those are both important, but the emphasis traditionally was only on that exact reference data, and the volume of data that's about directional trends today is much higher and much more strategic.
Editor's note: This Q&A has been edited for clarity and conciseness.