Solve real-time analytics challenges across operational and data lake data

GridGain Systems

Digital transformation unleashes massive amounts of data. A typical organization faces challenges when trying to use tools — such as those involving IoT — that require real-time analysis of the data it collects. Driving real-time business processes requires operational and data lake acceleration, which can be accomplished using in-memory computing as a hybrid transactional/analytical processing platform with the ability to run federated queries across company-wide data.

Today, companies typically store a copy of their operational data in a data lake, often built on Hadoop, where it is available for later analysis. However, for a growing number of digital transformations and omnichannel customer experiences, companies find they must run real-time analytics across their operational data set and a subset of the data in their data lake. For traditional infrastructures, real-time analytics proves challenging because of delays in accessing and processing data in a data lake and the difficulties in running federated queries across operational and archived data. Mature in-memory computing (IMC) technology can help resolve these obstacles. The technology offers real-time performance and massive scalability with built-in integrations of popular data platforms. The platforms are capable of running real-time analytics across operational and data lake data sets.

Consider the following situation: A copier company has tens of thousands of connected copiers deployed at thousands of locations. Equipped with many real-time sensors, the copiers are IoT devices sending massive amounts of data back to the company’s IoT platform, which ingests, processes and analyzes the device data. To increase asset utilization, the company uses predictive maintenance to service machines before they fail. To do this, the company must be able to run real-time analytics on the streaming operational data and, if a potential problem is flagged, compare the operational data with that copier’s historical data in the data lake. The organization can then use the data to troubleshoot the issue for a specific copier in depth and respond in real time.

Performing analyses across streaming operational big data and data stored in a data lake requires hybrid transactional/analytical processing (HTAP) that can ingest and process the operational data while accessing hot tables in the data lake for federated queries. IMC platforms have emerged as the critical component to make that possible.

In-memory computing capabilities today

IMC ingests, processes and analyzes operational data with real-time performance and scalability to petabytes of in-memory data. Today’s IMC platforms include some or all of these capabilities:

An in-memory data grid (IMDG) or in-memory database (IMDB): IMDGs and IMDBs that are deployed on a cluster of servers on-premises, in the cloud or both pool available memory and compute, keeping all data in memory, which eliminates constants disk reads and writes. An IMDG is deployed atop an existing database and keeps the underlying database in sync. An IMDB holds data in memory for processing, with all data written to disk for backup and recovery. The IMDB can also process against the disk-based data set for fast restarts and a tradeoff between application speed and infrastructure costs by maintaining only some of the data in memory.

HTAP or hybrid operational/analytical processing capabilities: Systems maintain a single data set where simultaneous transactional and analytical processing is executed. This eliminates the costly and time-consuming extract, transform, load process required to move data from a dedicated online transaction processing infrastructure to a separate online analytical processing infrastructure.

Streaming data processing capabilities: Manage the complexity of moving data, enabling IMC platforms to rapidly ingest, transact and analyze high-volume data streams with real-time performance.

Machine learning and deep learning capabilities: IMC platforms that incorporate machine learning libraries offer what Gartner refers to as in-process HTAP, real-time updates based on operational data to machine learning models. IMC platforms which incorporate native integrations with deep learning platforms, such as TensorFlow, can dramatically decrease the cost and complexity of transferring data to deep learning training platforms and updating deep learning models following training.

Some IMC platforms use built-in integrations to connect with popular streaming data platforms, such as Apache Kafka, and data processing tools, such as Apache Spark for connecting to Apache Hadoop.

Apache Kafka: Kafka builds the data pipelines and streaming apps that process incoming data in real-time.
Apache Spark: Spark is a unified analytics engine that performs large-scale data processing on data, such as powering federated queries and transferring data from a Hadoop-based data lake to an operational data store.
Apache Hadoop: Hadoop includes a distributed file system that provides high-throughput access to application data.

A new infrastructure for real-time analytics across operational and data lake data

Consider another example: An airline is collecting a continuous stream of data from its airplane engines. The data is being ingested, processed, analyzed and then stored in a data lake, with only the most recent data retained in the operational data store. Suddenly, an anomalous reading in the live data triggers an alert for a particular engine. To identify the root cause of the problem, the system needs to analyze the most recent engine data, which is in the operational data store, along with all historical data for that engine, which is stored in the data lake.

The airline’s new infrastructure, powered by an IMC platform, Kafka, Spark and Hadoop, makes this possible. Kafka feeds the live streaming data to the IMC platform and to the Hadoop data lake. Spark retrieves required data from the data lake and delivers it to the IMC platform. The IMC platform maintains the combined data set in memory and runs real-time queries across the combined data set. The result is deep and immediate insight into the causes of the anomalous reading.

The ability to run real-time analytics on operational data and a subset of data lake data can power a new era of real-time services and business decision-making, including predictive maintenance services and faster reaction to data anomalies that lead to increased asset utilization and ROI. These capabilities help companies improve the design of their products and services, including IoT platforms, and create new real-time uses.

All IoT Agenda network contributors are responsible for the content and accuracy of their posts. Opinions are of the writers and do not necessarily convey the thoughts of IoT Agenda.