Getty Images

Tutorial

How to run ML workloads with Apache Spark on Kubernetes

IT staff looking for ways to maintain ML workloads with ease are increasingly turning to Apache Spark. Follow these simple steps to set up a Spark cluster on Kubernetes.

Chris Tozzi

Published: 22 Aug 2023

The dynamic pairing of Spark on Kubernetes can lead to a wide range of benefits. To get Spark up and running on Kubernetes, IT teams just need some easy-to-learn commands.

Spark doesn't have to run on Kubernetes. But in many use cases, pairing the two can simplify Spark deployment while running machine learning (ML) workloads efficiently in a distributed environment.

What is Apache Spark?

Apache Spark is an open source data processing platform designed for ML workloads. Spark's main features include the following:

The ability to process large volumes of data quickly, especially when the data is stored in memory.
Support for real-time processing of data streams.
Highly customizable data processing workflows.
Multiple deployment models, which means that Spark can run on top of a Hadoop cluster if desired or operate on its own.

Thanks to these features, especially its fast data processing capabilities, Spark has become the de facto open source tool for powering ML workloads that require large-scale data processing.

The benefits of running Spark on Kubernetes

Kubernetes is not required to run Spark. But choosing to run Spark on top of Kubernetes can provide several advantages:

The ability to move Spark applications easily between different Kubernetes clusters, which is a benefit if you don't want to be locked into a particular infrastructure platform.
Support for segmenting Spark applications from each other while still housing them all within a single Kubernetes cluster.
A unified approach to application deployment and management, since you can manage everything through Kubernetes.
The ability to use Kubernetes ResourceQuotas to manage the resources allocated to Spark.

Apache Spark gained native support for Kubernetes starting with Spark 2.3. Native support means that you can deploy and manage Spark applications just like any other Kubernetes application by using container images and pods. You don't need any special tools or extensions to make Spark compatible with Kubernetes.

Steps for deploying Spark on Kubernetes

To deploy Spark on Kubernetes, start by creating a Deployment for a Spark Master.

kind: Deployment
apiVersion: apps/v1
metadata:
  name: spark-master
spec:
  template:
     metadata:
       labels:
         component: spark-master
     spec:
       containers:
         - name: spark-master
                     image: apache/spark:v3.3.0
           command: ["/spark-master"]
             ports:
           - containerPort: 6000
           - containerPort: 8080

Save this file as "spark-master.yml."

Next, create a Service.

kind: Service
apiVersion: v1
metadata:
  name: spark-master
spec:
  ports:
     - name: ui
     port: 8080
     targetPort: 8080
     - name: spark
     port: 6000
     targetPort: 6000
  selector:
     component: spark-master

Save your Service configuration as "spark-master-service.yml."

Then, create the Deployment and Service in your Kubernetes clusters by running the following.

kubectl create -f ./kubernetes/spark-master.yml
kubectl create -f ./kubernetes/spark-master-service.yml

At this point, you have a Spark master running. Now you can create a Spark worker to complete your cluster.

kind: Deployment
apiVersion: apps/v1
metadata:
  name: worker
  template:
     metadata:
     labels:
     component: spark-worker
     spec:
       containers:
         - name: worker
               image: apache/spark:v3.3.0
           command: ["/worker"]
           ports:
           - containerPort: 8081

Save this file as "spark-worker.yml" and deploy it with the following.

kubectl create -f ./kubernetes/spark-worker.yml

Now a basic Spark cluster is up and running on Kubernetes. You can begin submitting Spark workloads to your master container.

To move toward a production environment, you could deploy additional workers to scale up the cluster. You may also want to open your cluster to external resources by setting up a Service that exposes a public IP address or creating an ingress rule. Refer to the Spark documentation for additional details.

Next Steps

How and why to run machine learning workloads on Kubernetes

Dig Deeper on Containers and virtualization

Part of: What DevOps teams should know about MLOps

Up Next

Meeting the challenges of scaling AI with MLOps

As businesses race to capitalize on the promises of AI in the wake of ChatGPT's launch, strategies to move machine learning software from idea to reality are becoming essential.

Decide when and how to adopt an MLOps framework

Unsure where to start when it comes to standardizing your organization's machine learning processes? Explore key considerations and practical tips for adopting an MLOps framework.

Battle of the buzzwords: AIOps vs. MLOps square up

Another -Ops has entered the arena: MLOps. Is it just another buzzword, or does the term hold its own weight? Learn more about it and how it compares to AIOps.

DataOps vs. MLOps: Streamline your data operations

How many Ops combos can we get? What's DataOps? How is it different from MLOps? This article clarifies the differences and how to choose one over the other.

Set up a machine learning pipeline in this Kubeflow tutorial

For teams running machine learning workflows with Kubernetes, using Kubeflow can lead to faster, smoother deployments. Get started with this installation guide.

How to run ML workloads with Apache Spark on Kubernetes

IT staff looking for ways to maintain ML workloads with ease are increasingly turning to Apache Spark. Follow these simple steps to set up a Spark cluster on Kubernetes.

How to run ML workloads with Apache Spark on Kubernetes

IT staff looking for ways to maintain ML workloads with ease are increasingly turning to Apache Spark. Follow these simple steps to set up a Spark cluster on Kubernetes.

What is Apache Spark?

The benefits of running Spark on Kubernetes

Steps for deploying Spark on Kubernetes

Next Steps

Dig Deeper on Containers and virtualization

Conduct load tests with JMeter on Kubernetes

Compare Kubernetes StatefulSet vs. deployment vs. DaemonSet

Hands-on Kubernetes interview questions

How to manage stateful containers with Kubernetes

What is Apache Spark?

The benefits of running Spark on Kubernetes

Steps for deploying Spark on Kubernetes

Next Steps

Related Resources

Dig Deeper on Containers and virtualization

Conduct load tests with JMeter on Kubernetes

Compare Kubernetes StatefulSet vs. deployment vs. DaemonSet

Hands-on Kubernetes interview questions

How to manage stateful containers with Kubernetes