Win $20,000. Help build the future of education. Answer the call. Learn more

Deploy Apache Spark jobs to Kubernetes using Tekton

Introduction

Apache Spark is a unified analytics engine for large-scale data processing, but creating a runnable Spark job on Kubernetes can be complicated. You need to compile your application and all its dependencies and then build a runnable container that can launch and monitor your job on Kubernetes. This creates a number of hoops you have to jump through before you can run your code. This tutorial simplifies that process by showing how you can use Tekton Pipelines to automate the deployment of a Spark job on Kubernetes.

Prerequisites

Estimated Time

This tutorial should take you approximately 15 to 20 minutes to complete.

Steps

Install Tekton Pipelines

Install Tekton Pipelines on your cluster by running the following command:

kubectl apply --filename https://storage.googleapis.com/tekton-releases/pipeline/latest/release.yaml

Note: At the time of writing, this command installs release v0.22.0. You can also install a different version by following the instructions to install a specific release.

Install Tekton CLI and Dashboard

While not necessary, you can optionally install tools to make it easier to view the status of your Tekton jobs. If you want a command line-based option, install the Tekton CLI by following the operating system-appropriate instructions.

For a graphical option, you can install the Tekton Dashboard by running the following command:

kubectl apply --filename https://github.com/tektoncd/dashboard/releases/latest/download/tekton-dashboard-release.yaml

Note: At the time of writing, this command installs release v0.15.0. You can also install a different version by following the instructions to install a specific release.

Install Tasks to the cluster

Now that you have installed Tekton, you can go about setting up your build pipeline. Your build process looks like this:

  1. Clone your git repository.
  2. Build your container image.
  3. Push your container to a container repository.
  4. Deploy your application to the cluster.

Tekton provides a catalog of common Tasks that you can use instead of having to reinvent the wheel. For this tutorial, there are preexisting Tasks for each step of the process.

To clone your repository, use the git-clone Task:

  kubectl apply -f https://raw.githubusercontent.com/tektoncd/catalog/master/task/git-clone/0.2/git-clone.yaml

The build and push steps are combined into a single kaniko Task:

kubectl apply -f https://raw.githubusercontent.com/tektoncd/catalog/master/task/kaniko/0.1/kaniko.yaml

To deploy your image, use the kubectl-deploy-pod Task:

kubectl apply -f https://raw.githubusercontent.com/tektoncd/catalog/master/task/kubectl-deploy-pod/0.1/kubectl-deploy-pod.yaml

That’s it. Now it’s time to combine these Tasks into a pipeline.

Create a pipeline to run Tasks

Next, you create a pipeline to run your Tasks. The pipeline runs the git-clone Task, followed by the kaniko Task, and then finally the kubectl-deploy-pod Task.

Pipeline workflow

Save the following code into a file called pipeline.yaml:

apiVersion: tekton.dev/v1beta1
kind: Pipeline
metadata:
  name: clone-build-push
spec:
  workspaces:
  - name: shared-data
  params:
  - name: url
    description: git url to clone (required)
  - name: IMAGE
    description: the name (reference) of the image to build (required)
  - name: DOCKERFILE
    description: the path to the Dockerfile to execute (default ./Dockerfile)
  - name: CONTEXT
    description: the build context used by Kaniko (default ./)
  - name: action
    description: the action to perform to the resource (required)
  - name: manifest
    description: the content of the resource to deploy (required)
  - name: success-condition
    description: condition that indicates deploy was successful (optional)
  - name: failure-condition
    description: condition that indicates deploy failed(optional)
  - name: output
    description: values from completed resource values to extract and write to /tekton/results/$(name)
  tasks:
  - name: clone
    taskRef:
      name: git-clone
    workspaces:
    - name: output
      workspace: shared-data
    params:
      - name: url
        value: "$(params.url)"
  - name: build-push
    taskRef:
      name: kaniko
    runAfter:
      - clone
    params:
    - name: IMAGE
      value: "$(params.IMAGE)"
    - name: DOCKERFILE
      value: "$(params.DOCKERFILE)"
    - name: CONTEXT
      value: "$(params.CONTEXT)"
    workspaces:
    - name: source
      workspace: shared-data
  - name: deploy
    taskRef:
      name: kubectl-deploy-pod
    runAfter:
      - build-push
    params:
    - name: action
      value: "$(params.action)"
    - name: manifest
      value: "$(params.manifest)"
    - name: success-condition
      value: "$(params.success-condition)"
    - name: failure-condition
      value: "$(params.failure-condition)"
    - name: output
      value: "$(params.output)"

Apply pipeline.yaml to the cluster by running the following:

kubectl apply -f pipeline.yaml

Now that your pipeline is ready, you just need to configure the credentials you need to run it.

Deploy a secret for a container registry

The last thing you need to do is set up your credentials to authenticate to your Docker repository. You do this by creating a registry secret so that you can push your image to a container registry.

To create this secret, fill in the appropriate values and run the following command:

kubectl create secret docker-registry regcred \
    --docker-server=<server> \
    --docker-username=<username> \
    --docker-password=<password> \
    --docker-email=<email>

If you’re using Docker Hub as your registry, you can omit the docker-server option. If you are using other registries, you might need to include the tekton.dev/docker-0 annotation for your secret by running the following:

kubectl annotate secret regcred tekton.dev/docker-0=<server>

Finally, add this secret to your default service account by running the following:

kubectl patch sa default --type='json' -p='[{"op": "add", "path": "/secrets/1", "value": {"name": "regcred" } }]'

Now that the setup is complete, you can run your job.

Create a PipelineRun to run the pipeline

To run your pipeline, you need to create a PipelineRun. The PipelineRun provides the details to your pipeline and Task at execution time.

apiVersion: tekton.dev/v1beta1
kind: PipelineRun
metadata:
  name: build-spark-with-tekton
spec:
  pipelineRef:
    name: clone-build-push
  workspaces:
  - name: shared-data
    volumeClaimTemplate:
      spec:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 1Gi
  params:
    - name: url
      value: "https://github.com/IBM/spark-tekton"
    - name: IMAGE
      value: "psschwei/containerized-app:latest"
    - name: DOCKERFILE
      value: "s2i.Dockerfile"
    - name: CONTEXT
      value: "."
    - name: action
      value: "create"
    - name: output
      value: |
        - name: job-name
          valueFrom: '{.metadata.name}'
        - name: job-namespace
          valueFrom: '{.metadata.namespace}'
    - name: success-condition
      value: "status.applicationState.state = COMPLETED"
    - name: failure-condition
      value: "status.applicationState.state = FAILED"
    - name: manifest
      value: |
        apiVersion: sparkoperator.k8s.io/v1beta2
        kind: SparkApplication
        metadata:
          name: spark-pi
          labels:
            job: spark-test
        spec:
          sparkVersion: 2.4.5
          type: Scala
          mode: cluster
          image: psschwei/containerized-app:latest
          imagePullPolicy: Always
          mainClass: SparkPi
          mainApplicationFile: local:///opt/spark/target/scala-2.11/kubernetes-assembly-1.0.0.jar
          driver:
            memory: 1g
          executor:
            instances: 2
            memory: 2g

Note that you need to update the previous YAML to use your own container registry namespace.

If you’re using the sample GitHub repo, note that you are using s2i.Dockerfile for this build, not Dockerfile. The s2i Dockerfile runs a multistage build that creates a container image from source code, whereas the regular Dockerfile in this repo assumes that you have already created your Scala binary.

Save the following code into a file called pipelinerun.yaml and run it via the following command:

kubectl apply -f pipelinerun.yaml

That’s it. Your job is now running.

Monitor the status of PipelineRun

To monitor the status of your job in real time, you can run the following command to view the logs (note that you need to add the appropriate hash to the name of your pod):

$ kubectl get pods
NAME                                                 READY   STATUS      RESTARTS   AGE
affinity-assistant-0e390a2f80-0                      1/1     Running     0          117s
build-spark-with-tekton-build-push-x5fgk-pod-7g82k   3/3     Running     0          18s
build-spark-with-tekton-clone-65php-pod-mhs2n        0/1     Completed   0          117s

$ kubectl logs build-spark-with-tekton-build-push-x5fgk-pod-7g82k
error: a container name must be specified for pod build-spark-with-tekton-build-push-x5fgk-pod-7g82k, choose one of: [step-build-and-push step-write-digest step-digest-to-results] or one of the init containers: [place-scripts working-dir-initializer place-tools]

$ kubectl logs build-spark-with-tekton-build-push-x5fgk-pod-7g82k step-build-and-push | tail -n 10
INFO[0028] args: [-c mkdir /wkdir]                      
INFO[0028] Running: [/bin/sh -c mkdir /wkdir]           
INFO[0028] Taking snapshot of full filesystem...        
INFO[0029] COPY . /wkdir                                
INFO[0030] Taking snapshot of files...                  
INFO[0030] RUN cd /wkdir && sbt assembly                
INFO[0030] cmd: /bin/sh                                 
INFO[0030] args: [-c cd /wkdir && sbt assembly]         
INFO[0030] Running: [/bin/sh -c cd /wkdir && sbt assembly]
[info] [launcher] getting org.scala-sbt sbt 1.4.3  (this may take some time)...

If you installed the Tekton CLI tool or the dashboard earlier, you can also use them to monitor your job.

$ tkn pr describe build-spark-with-tekton
Name:              build-spark-with-tekton
Namespace:         default
Pipeline Ref:      clone-build-push
Service Account:   default
Timeout:           1h0m0s
Labels:
 tekton.dev/pipeline=clone-build-push

🌑️  Status

STARTED         DURATION    STATUS
9 minutes ago   6 minutes   Succeeded

πŸ“¦ Resources

 No resources

βš“ Params

 NAME           VALUE
 βˆ™ url          https://github.com/psschwei/containerized-app.git
 βˆ™ IMAGE        psschwei/containerized-app:latest
 βˆ™ DOCKERFILE   s2i.Dockerfile
 βˆ™ CONTEXT      .
 βˆ™ action       create
 βˆ™ output       - name: job-name
  valueFrom: '{.metadata.name}'
- name: job-namespace
  valueFrom: '{.metadata.namespace}'

 βˆ™ success-condition   status.applicationState.state = COMPLETED
 βˆ™ failure-condition   status.applicationState.state = FAILED
 βˆ™ manifest            apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
  name: spark-pi
  labels:
    job: spark-test
spec:
  sparkVersion: 2.4.5
  type: Scala
  mode: cluster
  image: psschwei/containerized-app:latest
  imagePullPolicy: Always
  mainClass: SparkPi
  mainApplicationFile: local:///opt/spark/target/scala-2.11/kubernetes-assembly-1.0.0.jar
  driver:
    memory: 1g
  executor:
    instances: 2
    memory: 2g


πŸ“ Results

 No results

πŸ“‚ Workspaces

 NAME            SUB PATH   WORKSPACE BINDING
 βˆ™ shared-data   ---        VolumeClaimTemplate

πŸ—‚  Taskruns

 NAME                                         TASK NAME    STARTED         DURATION    STATUS
 βˆ™ build-spark-with-tekton-deploy-nbsv6       deploy       4 minutes ago   1 minute    Succeeded
 βˆ™ build-spark-with-tekton-build-push-75xq2   build-push   7 minutes ago   3 minutes   Succeeded
 βˆ™ build-spark-with-tekton-clone-n62f9        clone        9 minutes ago   1 minute    Succeeded

Dashboard

All that’s left to do now is wait for your PipelineRun to finish.

Summary

Congratulations! You’ve successfully built a Tekton Pipeline that you can use to deploy your Apache Spark job. As a next step, look into running your job on a schedule using a ScheduledSparkApplication. And after that, use Tekton Triggers to rebuild your application automatically with every code change you push to GitHub.