putilov_denis - stock.adobe.com

NLP and AI boost the automated data warehouse

Businesses are working to automate as many elements of their data warehouses as they can through nascent tools like augmented analytics and natural language processing.

As digital business continues to accelerate, enterprises are automating elements of their data warehouses to accelerate their data-to-insights cycles with the help of AI and machine learning. Augmented analytics plays a role, as do traditional tools such as ETL (extract, transform and load). Collectively, the landscape of increasingly intelligent data management tools helps make data more accessible and usable.

The augmented analytics influence

Augmented analytics is the current state of the art when it comes to data analytics. Instead of typing queries in SQL, users can simply use natural language.

Another distinguishing factor of augmented analytics platforms has extended past analytics to include data preparation and even some data warehouse capabilities. According to Mark Beyer, research vice president and analyst at Gartner, the role of augmented analytics is to discover data usage patterns that determine who is accessing what data, how often, in which combinations and the rate of acceleration or deceleration of their overall engagement.

"Augmented analytics can only learn from patterns and previous activity. They can add data analysis at the content profiling level by individual assets and by inferring that similar data in different data sets might be the same data," Beyer said. "Any inference model would have to be trained to recognize long-term patterns, requiring both time and many use cases interfacing with the same data to show how variable the patterns are and what conditional scenarios drive the different variations."

Augmented analytics platform provider Qlik offers a suite of data management tools that are packaged in a single SKU. Qlik Replicate, a universal data replication and ingestion tool, integrates with Qlik Compose, a data lake and automation tool, to enable and automate batch and real-time data feeds from source systems to data warehouses and lakes.

Qlik Enterprise Manager centrally manages data replication and pipeline automation across the enterprise, providing a single point of control for designing, executing and monitoring Replicate and Compose tasks.

The resulting data structures and metadata are shared with Qlik Catalog so users can provision data directly from Catalog into the Qlik Sense augmented analytics platform or similar platforms like Power BI and Tableau.

"Qlik enables batch and continuous migration of data across many data sources and targets both on premises and in the cloud," said Anand Rao, product marketing director at Qlik. "[It] supports use cases ranging from cloud migrations to platform modernization and integrates closely with all major cloud vendors."

Augmented analytics platform provider Sisense offers a full suite of data management capabilities including ingest, manual and AI preparation, modeling, governance and cataloging. Any of those capabilities can be swapped out for best-of-breed services that may be more specialized in a designated area.

ML-based data preparation is easily the biggest trend that we see in the space. The amount of time people spend combing through tables to perform tasks as simple as deduplications is stunning and can be automated away.
Ryan SegarSenior vice president of field engineering, Sisense

"One of the most unique things about Sisense is that we architected [it] as a true microservices solution, [so] every workflow can be supplemented or swapped out entirely," said Ryan Segar, senior vice president of field engineering at Sisense.

For example, with ETL, customers can use Stitch, Fivetran, CData or Matillion. For data warehouses or data lakes, they can use Redshift, Snowflake, SingleStore, Databricks or BigQuery. For governance and cataloging, they can use Collibra, Alation, BigID, Alteryx, Trifacta and others.

"ML-based data preparation is easily the biggest trend that we see in the space," Segar said. "The amount of time people spend combing through tables to perform tasks as simple as deduplications is stunning and can be automated away."

Natural language processing moves to the forefront

Natural language processing (NLP) was added to data analytics platforms so less technical users, such as "citizen data scientists," can access and analyze data.

"NLP understands the user's intent and parses search strings to identify the key attributes of an analytical query. [It] then leverages AI to generate the best insights for the user, which can be refined and added to dashboards for further explanation," said Rao. "Similarly, queries to an automated data warehouse can benefit from NLP, allowing business analysts to request data and analytics calculations without sophisticated SQL queries."

Rao defines data warehouse automation as creating and importing data models, performing custom mappings of data types across different data stores, adhering to data validation and quality rules and creating data warehouses or derived data marts. An NLP-driven query generator tool can initially be used to update individual tasks within a workflow and ultimately replace the less complex, downstream tasks.

A chart comparing data warehouse and data marts.
A data warehouse vs. data mart.

"The NLP-driven tool must be capable of generating typical OLAP [online analytical processing] queries to create data marts with the requested data sets," Rao said.

While the semantic layer has made self-serve data accessible, the growing amount and types of data have revealed its fatal flaw, according to Segar.

"Humans are still the ones creating and maintaining it. NLP has been driven by advancements in data cataloging, and ML changes the game so systems can retrain themselves on business words being used across global and local cases," Segar said. "When implemented properly, we can automate the most difficult task in data management: recognizing the uniqueness of each user rather than training them to think differently."

If the language is analyzed for consistent use over many occurrences and has a subject, object, predicate (SOP) construct, then it can be parsed for code inputs, according to Gartner's Beyer. For instance, computing code always has the same SOP construct:

  • Subjects are derived from a department, business unit or task functional requirement such as "patient admission."
  • Objects are the desired attributes or memory arrays to be populated, such as "date and time for patient admitting."
  • Predicates are verb phrases in the language, such as "admitted patient."

"The NLP can thus be coded as a program module for capturing patient admission as the subject. The objects are the patient identifier, hospital, ward identifier (if applicable) and date and time at which the patient is logged as present. The predicate is capture data inputs," Beyer said. "On the back end, an augmented system might know how that data is used based on previous use cases. It could also learn the typical types of errors and create controls for screening them using data quality or query plan rules as a second predicate."

The bottom line is that data warehouses continue to become more automated over time with the help of AI and machine learning. Augmented analytics is part of the fabric that helps drive value from data by empowering more people to glean important contextual insights.

Dig Deeper on Data warehousing