JNT Visual - Fotolia
It's still early going for containerizing the big data implementation process. However, users and vendors alike are increasingly eying software containers and Kubernetes, a technology for orchestrating and managing them, as tools to help ease deployments of big data systems and applications.
Early adopters expect big data containers running in Kubernetes clusters to accelerate development and deployment work by enabling system builds and application code to be reused. Containers in Kubernetes big data environments should also make it easier to move systems and applications to new platforms, reallocate computing resources as workloads change and optimize the use of available IT infrastructure, advocates say.
The pace is picking up on big data technology vendors adding support for containers and Kubernetes to their product offerings. For example, at the Strata Data Conference in San Jose, Calif., this month, MapR Technologies Inc. said it has integrated a Kubernetes volume driver into its big data platform to provide persistent data storage for containerized applications tied to the container orchestration technology.
MapR previously supported the use of specialized Docker containers with built-in connectivity to the MapR Converged Data Platform, but the Kubernetes extension is "much more transparent and native to the environment," said Jack Norris, the Santa Clara, Calif., company's senior vice president of data and applications. He added that the persistent storage capability lets containers be used for stateful applications, a requirement for a typical big data implementation with Hadoop and related technologies.
Also, the version 2.3 update of the open source Apache Spark processing engine released in late February includes a native Kubernetes scheduler. The Spark on Kubernetes technology, which is being developed by contributors from Bloomberg, Google, Intel and several other companies, is still described as experimental in nature, but it enables Spark 2.3 workloads to be run in a Kubernetes cluster.
Not to be outdone, an upcoming 1.5 release of Apache Flink -- a stream processing rival to Spark -- will provide increased ties to both Kubernetes and the rival Apache Mesos technology, according to Fabian Hueske, a co-founder and software engineer at Flink vendor Data Artisans. Users can run the Berlin-based company's current Flink distribution on Kubernetes, "but it's not always straightforward to do that now," Hueske said at the Strata conference. "It will be much easier with the new release."
Big data containers achieve liftoff
JD.com Inc., an online retailer based in Beijing, is an early user of Spark on Kubernetes. The company has also containerized TensorFlow, Caffe and other machine learning and deep learning frameworks in a single Kubernetes-based architecture, which it calls Moonshot.
The container approach is designed to streamline and simplify big data implementation efforts in support of machine learning and other AI analytics applications that are being run in the new architecture, said Zhen Fan, a software development engineer at JD.com. "A major consideration was that we should support all of the AI workloads in one cluster so we can maximize our resource usage," Fan said during a conference session.
However, he added that the containers also make it possible to quickly deploy analytics systems on the company's web servers to take advantage of overnight processing downtime.
"In e-commerce, the [web servers] are quite busy until midnight," Fan said. "But from 12 to 6 a.m., they can be used to run some offline jobs."
JD.com began work on the AI architecture in mid-2017; the retailer currently has 300 nodes running production jobs in containers, and it plans to expand the node count to 1,000 in the near future, Fan said. The Spark on Kubernetes technology was installed in the third quarter of last year, initially to support applications run with Spark's stream processing module.
However, that part of the Kubernetes big data deployment is still a proof-of-concept project intended to test "if Spark on Kubernetes is ready for a production environment," said Wei Ting Chen, a senior software engineer at Intel, which is helping JD.com build the architecture.
Chen noted that some pieces of Spark have yet to be tied to Kubernetes, and he cited several other issues that need to be assessed. For example, JD.com and Intel are looking at whether using Kubernetes could cause performance bottlenecks when launching large numbers of containers, Chen said. Reliability is another concern, as more and more processing workloads are run through Spark on Kubernetes, he added.
Out on the edge with Kubernetes
Spark on Kubernetes is a bleeding-edge technology that's currently best suited to big data implementations in organizations that have sufficient "technical muscle," said Vinod Nair, director of product management at Pepperdata Inc., a vendor of performance management tools for big data systems that is involved in the Spark on Kubernetes development effort.
The Kubernetes scheduler is a preview feature in Spark 2.3 and likely won't be ready for general availability for another six to 12 months, according to Nair. "It's a fairly large undertaking, so I expect it will be some time before it's out in production," he said. "It's at about an alpha test state at this point."
Pepperdata plans to support Kubernetes-based containers for Spark and the Hadoop Distributed File System in some of its products, starting with Application Spotlight, a performance management portal for developers of big data applications that the Cupertino, Calif., company announced this month. With the recent release of Hadoop 3.0, the YARN resource manager built into Hadoop can also control Docker containers, "but Kubernetes seems to have much bigger ambitions to what it wants to do," Nair said.
Not everyone is sold on Kubernetes -- or K8s, as it's informally known. For example, BlueData Software Inc. uses a custom orchestrator to manage the Docker containers at the heart of its big-data-as-a-service platform. Tom Phelan, co-founder and chief architect at BlueData, said he still thinks the homegrown container orchestration tool has a technical edge on Kubernetes, particularly for stateful big data applications.
Kinnary Janglasenior software engineer, Pinterest
Phelan added, though, that the Santa Clara, Calif., vendor is working with Kubernetes in the lab with an eye on possible future adoption.
Pinterest Inc. is doing the same thing. The San Francisco company is moving to use Docker containers to speed up development and deployment of various machine learning applications that help drive its image bookmarking and social networking site under the covers, said Kinnary Jangla, a senior software engineer at Pinterest.
Jangla, who built a container-based setup for debugging machine learning models as a test case, said in a presentation at Strata that Pinterest is also testing a Kubernetes big data cluster. "We're trying to see if that is going to be useful to us as we migrate to production," she said. "But we're not there yet."