Sergey Nivens - Fotolia
How to navigate the challenges of the data modeling process
Data modeling and curation can help businesses more efficiently use data they've collected. There are challenges, however -- beginning with ensuring data quality.
Enterprises are adopting data modeling and curation practices to help bring order to a process to quickly create business value. It sounds easy on paper, but managers should consider several challenges to making the data modeling process work effectively.
At a high level, the biggest challenges include ensuring that the data correlates with the real world and that it can be woven into existing business processes and analytics apps. Business managers frequently underestimate the amount of time it takes to clean data.
Other challenges include data curation and modeling across disparate sources and data stores, as well as ensuring security and governance in the process.
Challenge No. 1: Ensuring data quality
Data quality and the amount of effort required to address data quality issues are usually the biggest challenges in this area, said Ryohei Fujimaki, founder and CEO of dotData, a data science automation platform.
"The traditional process to deal with data quality involves a lot of hard-coded business logic, which makes the data pipeline very difficult to maintain and to scale," he said.
Cleaning and preparing data in an automated way typically requires a significant upfront investment in data engineering to improve the data sources, data transportation and data quality. These efforts can be stymied when managers don't know the potential business value of a project.
Challenge No. 2: Identifying contributors to dirty data
If the data at the source is not accurate, everything in the rest of the data modeling process that's based on that data will crumble like dominos.
"All of the decisions or insights generated will be exponentially inaccurate, and this is what most businesses are facing today," said Kuldip Pabla, senior vice president of engineering at K4Connect, a technology platform for seniors and individuals living with disabilities.
Ryohei FujimakiFounder and CEO, dotData
Data inaccuracy could creep in during the creation and acquisition of data, during the cleaning process or even when annotating data. For instance, in the healthcare or elderly markets, inaccurate health data could be the deciding factor between life and death. A wrong decision based on inaccurate data could lead to severe consequences, Pabla said.
Challenge No. 3: Enabling fitness for purpose
Fitness for purpose means the data is correct and trustworthy for its intended uses according to the rules that govern it. According to Justin Makeig, director of product management at MarkLogic, an operational database provider, "The real difficulty is that correct and trustworthy are very contextual. They vary [based] on the data and how it's being used."
For example, a health insurance provider might use the same data about patients to analyze the quality of care and to perform billing. Those are very different concerns with very different rules, even if they both use the same or overlapping data.
Challenge No. 4: Breaking down data silos
Another challenge to the data modeling process is siloed systems, which often run on legacy architectures and undermine the effectiveness of predictive analytics tools. In addition, newer data sources, such as third-party apps, premium data sources, and self-service BI and machine learning curated data sets, are growing in importance.
"Knowing what data you have available [and] efficiently empowering data-driven cultures of self-service users is no easy feat," said Jen Underwood, senior director at DataRobot, an automated machine learning platform.
Privacy regulations also create new obstacles to bringing these data silos together efficiently.
Challenge No. 5: Avoiding the Las Vegas effect
Analytics users can also inadvertently create new data silos in the cloud when they use SaaS-specific analytics platforms like Salesforce Einstein to curate and model data sets outside of centralized data engineering efforts. Organizations need to strike a balance between the benefits of ad hoc analytics on these platforms and making it easier to adopt centralized data management practices.
"It's the Las Vegas effect: What happens in Vegas, stays in Vegas," said Jean-Michel Franco, senior director of data governance products at Talend.
For example, a business user might use Tableau to fix errors in data that comes from Salesforce. Although this fixes his problem, he has, in fact, created a new data silo.
Overall, this approach is also inefficient. Research from IDC shows that a typical data professional spends only 19% of his time analyzing data, while 81% is dedicated to finding data, preparing it to meet requirements, and protecting or sharing it.
Challenge No. 6: Starting fresh or cleaning old data
Enterprises need to confront a sunken cost bias when deciding on a curation and data modeling process. Many enterprises have assembled massive stores of data without figuring out how they fit into the business process.
Managers may feel a sense of loss when data engineers suggest that throwing this data out and starting over could reduce the cost and effort required to clean and model the data for specific goals.
"There is always a tradeoff between taking in more data and cleaning the data you already have," said Josh Jones, manager of analytics at Aspirent, an analytics consulting service. There are also often important negotiations between managers about who owns the data and who is responsible for cleaning it.
"Often, one group does not want to take on all the work," Jones said.
Different business units may also have different requirements for how they want to clean the data.
Challenge No. 7: Understanding what the business cares about
"The biggest challenge to data curation and modeling remains the same: Does the business care?" said Goutham Belliappa, vice president of AI engineering at Capgemini. If a business cannot see the value of data, then it will have no motivation to curate it.
Excel remains the No. 1 BI tool in most organizations, which is a fundamental demonstration that quality is secondary to immediacy, he said. Curation needs to add value to the data modeling process.
"What is bad data for one person could be good data for another," Belliappa said. For instance, bad sales data from a customer accuracy perspective is clearly bad for sales. But this same data might be ideal for implementing AI to identify challenges in the current sales process.
"Too often, people try and curate toward a single perspective, leading to other areas of the business disengaging," Belliappa said.