alex_aldo - Fotolia
Data modeling software tackles glut of new data sources
Data modeling platforms are starting to incorporate features to automate data-handling processes, but IT must still address entity resolution, data normalization and governance.
Data modeling, a key component of data management techniques and analytics processes, comprises many complex steps that are getting increasingly difficult as new sources of data are added.
Traditional approaches to analytics -- sometimes known as passive analytics -- were established before the rise of big data platforms and IoT and the growing mainstream adoption of technologies like AI, predictive analytics and location intelligence. "Today, businesses are left with assorted analytics technologies that struggle to align and apply advanced analytical techniques effectively," said Nima Negahban, CTO and co-founder of distributed database management system provider Kinetica.
Although data modeling software is starting to incorporate features that help automate some of the process of organizing and preparing all the incoming data for analysis, IT managers still need to develop a plan to address common problems, such as entity resolution, data normalization and data governance.
It's quite common for the same entity and multiple data records with heterogeneous schemas to exist. As companies expand their data sets, it can be increasingly difficult to correlate data across silos that relates to a single entity, like a person, product or business. But that's a crucial step when doing analytics in applications such as making sense of health data, targeted marketing and customer experience optimization.
How users put all the relevant data records into a central profile to describe the same entity is a challenge since ID linking and schema reconciliation are difficult processes, said Mingxi Wu, vice president of engineering at graph analytics platform maker TigerGraph. Data modeling software is just now starting to implement some basic techniques to make it easier to query across silos.
Normalizing data sets
Information is stored in a variety of applications and business systems, requiring a range of data collection processes to unify data from all these sources, said Rakesh Jayaprakash, product manager at ManageEngine, the IT management software division of Zoho Corp. A database can be directly queried for information as opposed to the more complex task of extracting data from a business application. One of the biggest challenges is to make sure data formats are normalized to fit into an organized structure before they're used to run analytics.
Nima NegahbanCTO and co-founder, Kinetica
One approach is to appoint data curators in different departments who can ensure the validity and relevance of the collected data, Jayaprakash suggested. Since BI and analytics users largely depend on structured data, he said a good practice is to structure data at the source -- setting up cleansing processes where the data is generated to prepare it for consumption by analytics platforms or storage in data warehouses.
To mine relevant data, companies need to have a clear purpose for running analytics. Otherwise, useful information may be buried under heaps of irrelevant data. Deploying data curation and modeling tools to gather data from different sources without a lot of tweaking under the hood can help improve the mining process, Jayaprakash said.
Building a data modeling framework
Another benefit of data modeling software and curation tools is improved self-service analytics, but caution and good governance need to be exercised when it comes to accessing these models. "Horror stories abound about contradictory analyses built on unvetted sources of data leading to flawed decisions," said John Hagerty, vice president of product management for business analytics at Oracle. Yet, overprotection of data sources might force business users to seek alternate routes to access the data.
Hagerty recommended curating a list of trusted data sources and making them available through security privileges granted by data stewards. Surprisingly, many of these sources are hidden from view. Also, when data analysts enrich data with more contextual information, there should be a way to mark these sources as sanctioned so they can be used by others. It's also good to discourage people from spinning up slightly different versions of data, he noted.
John HagertyVice president of product management for business analytics, Oracle
Data preparation capabilities embedded in tools can help refine the analytics experience. Enrichment recommendation engines can use AI to suggest ways to augment data sets. It's likewise important to select tools that can help catalog data in a way that keeps track of its provenance.
Data modeling and curation is ultimately an iterative process that requires communication between the IT teams capturing the data, the data engineers modeling it, the business users looking for new self-service analytics and the data scientists working to create machine learning models. "Instead of the traditional Waterfall methodology for analytics, data teams should follow in the steps of software development and adopt a much more Agile methodology," said Doug Bordonaro, field CTO at BI analytics tools provider ThoughtSpot. Data management tools are starting to include features that improve interaction with end users throughout the curation process.