Build a data streaming, AI and machine learning platform for IoT

Today’s IoT use cases increasingly depend on performing analytics or updating machine learning algorithms in real time on huge amounts of device-generated data. If the data for patient monitoring, autonomous vehicles or predictive maintenance applications isn’t ingested, processed and acted upon in real time, patients suffer, vehicles crash or systems fail. So how can businesses cost-effectively build a reliable platform for ingesting and responding to massive amounts of data at scale? Businesses can do so with a streaming platform and data storage system built on an open source software stack.

Many of today’s open source solutions have proven to be reliable across thousands of production deployments. Many are available with enterprise-grade support and consulting services from commercial businesses, which might also offer enterprise-grade versions of the solutions. These supported solutions enable businesses to achieve their digital transformation goals by implementing IoT solutions without significant upfront costs, while also providing their companies with dependable, future-proofed infrastructure. Below is a sampling of open source solutions that are foundational to many of today’s most successful digitally transformed businesses.

Streaming data

An open source streaming solution, such as Apache Kafka or Apache Flink, is used to build a real-time data pipeline that moves data across the systems and applications within an IoT deployment. For example, in a patient monitoring use case, the streaming solution would deliver the data collected by the IoT sensors attached to a patient to a platform where the data can be aggregated, analyzed and stored.

Kafka is used in production by Box, LinkedIn, Netflix, Oracle and Twitter. Flink is used in production at Alibaba, AWS, Capital One, eBay and Lyft. However, for the streaming solution to support real-time business processes at scale, it must be integrated with other technologies, including a distributed in-memory computing platform, a container management solution, and analytics and machine learning capabilities.

In-memory computing

Apache Ignite is a distributed in-memory computing platform deployed on a cluster of commodity servers. It can be used as an in-memory data grid inserted between an existing application and disk-based database or as a standalone in-memory database for new applications. Ignite pools the available CPUs and RAM of the cluster and distributes data and compute to the individual nodes. It can be deployed on premises, in a public or private cloud, or on a hybrid environment. Ignite supports ANSI-99 SQL and ACID transactions.

Ignite can ingest massive amounts of data in real time. With all data remaining in memory, Ignite uses MapReduce to execute massively parallel processing (MPP) across the distributed cluster. Leveraging both in-memory data caching and MPP, Ignite provides up to a 1,000x increase in application performance at scale versus applications that use a disk-based database. Ignite users can also leverage the native Kafka integration to make it easy to ingest streaming data from IoT devices into the in-memory computing cluster.

As I’ve discussed in a previous article, Ignite can be used to build a digital integration hub (DIH) for aggregating and processing data from multiple on-premises datastores, cloud-based data sources and streaming data feeds. As a DIH, Ignite provides a high-performance data access layer that makes the aggregated data available to multiple business applications in real time. Apache Ignite is used in production at American Airlines, IBM, ING and 24 Hour Fitness.

Cluster management

Kubernetes automates the deployment and management of applications that have been containerized in Docker or another container solution. Container solutions create a package that contains an application and a virtualized OS to enable running multiple, completely independent versions of the application on the same hardware or across virtualized hardware, such as on a cloud service. Kubernetes makes it easier to manage Docker containers and ensures consistency across a server cluster that can be deployed in any location, such as on premises, public or private cloud, or hybrid environment.

APIs enable Kubernetes to manage both Apache Ignite and the streaming platform resources and automatically scales the IoT in-memory computing-based cluster. This increased ease of management can dramatically reduce complexity and errors and reduce development time. Kubernetes is used in production at, Capital One, Box, IBM and Sling.

Analytics and machine learning

The final piece of the streaming platform puzzle is the ability to act on the data. For analytics use cases, Apache Spark is a distributed computing engine used for processing and analyzing large amounts of data. Spark can take advantage of the Apache Ignite in-memory computing platform to rapidly analyze the huge amounts of data being ingested via the streaming pipeline. Spark can also use Ignite as an online datastore, enabling Spark users to append data to their existing DataFrames or RDDs and rerun Spark jobs. Spark also makes it easy to write simple queries for unstructured data in a distributed computing environment. Spark is used in production at Amazon, Credit Karma, eBay, NTT Data and Yahoo!.

For machine learning use cases, Apache Ignite includes integrated, fully distributed machine learning and deep learning libraries that have been optimized for massively parallel processing. This integration enables businesses to create continuous learning applications where machine learning or deep learning algorithms run locally against the data residing in-memory on each node of the in-memory computing cluster. Running the algorithms locally allows for the continuous updating of models as new data is deployed on the nodes, even at petabyte scale.

All IoT Agenda network contributors are responsible for the content and accuracy of their posts. Opinions are of the writers and do not necessarily convey the thoughts of IoT Agenda.

Data Center
Data Management