Kubernetes container orchestration gets big data star turn

Jack Vaughan

Listen to this podcast

The new thing in big data is Kubernetes container orchestration. While it's still early, there are signs of activity, which are cited in this edition of the Talking Data podcast.

Podcast

The foundations of big data continue to shift, driven in great part by AI and machine learning applications. The push to work in real time, and to quickly place AI tools in the hands of data scientists and business analysts, has created interest in software containers as a more flexible mechanism for deploying big data systems and applications.

Now, Kubernetes container orchestration is emerging to provide an underpinning for the new container-based workloads. It has stepped into the big data spotlight -- one formerly reserved for data frameworks like Hadoop and Spark.

These frameworks continue to play an important role in big data, but in more of a supporting role, as discussed in this podcast review of the 2018 Strata Data Conference in San Jose, Calif. That's particularly true in the case of Hadoop, the featured topic in only a couple of sessions at the conference, which until last year was called Strata + Hadoop World.

"It's not that people are turning their backs on Hadoop," said Craig Stedman, SearchDataManagement's senior executive editor and a podcast participant. "But it is becoming part of the woodwork."

The attention of IT teams is shifting more toward the actual applications and how they can get more immediate value out of data science, AI and machine learning, he indicated. Maximizing resources is a must, and this is where Kubernetes-based containers are seen as potential helpers for teams looking to swap workloads in and out and maximize the use of computing resources in fast-moving environments.

Kubernetes connections for Spark and Flink, a rival stream processing engine, are increasingly being more closely watched.

At Strata, Stedman said, the deployment of a prototype Kubernetes-Spark combination and several other machine learning frameworks in a Kubernetes-based architecture was seen partly as a way to nimbly shift workloads between CPUs and GPUs, the latter processor type playing a growing role in training the neural network's underlying machine learning and deep learning applications.

The deployment was the work of JD.com Inc., a Beijing-based online retailer and Strata presenter. It is worth emphasizing the early adopter status of such implementations, however. While JD.com is running production applications in the container architecture, Stedman reported that it's still studying performance and reliability issues around the new coupling of Spark and Kubernetes that's included in Apache Spark 2.3 as an experimental technology.

Overall, in fact, there is much learning ahead for Kubernetes container orchestration when it comes to big data. That is because containers tend to be ephemeral or stateless, while big data is traditionally stateful, providing data persistence.

Bridging the two takes on state is the goal of a Kubernetes volume driver that MapR Technologies announced at Strata, which is integrated into the company's big data platform. As such, it addresses one of the obstacles Kubernetes container orchestration faces in big data applications.

Stedman said the march to stateful applications on Kubernetes continued to advance after the conference, as Data Artisans launched its dA Platform for what it described as stateful stream processing with Flink. The development and runtime environment is intended for use with real-time analytics, machine learning and other applications that can be deployed on Kubernetes in order to provide dynamic allocation of computing resources.

Listen to this podcast to learn more about the arrival of containers in the world of Hadoop and Spark and the overall evolution of big data as seen at the Strata event.