Introduction
Apache Spark is a unified analytics engine for large-scale data processing, but creating a runnable Spark job on Kubernetes can be complicated. You need to compile your application and all its dependencies and then build a runnable container that can launch and monitor your job on Kubernetes. This creates a number of hoops you have to jump through before you can run your code. This tutorial simplifies that process by showing how you can use Tekton Pipelines to automate the deployment of a Spark job on Kubernetes.
Prerequisites
- A Kubernetes cluster, using version 1.16+
- A Spark application in GitHub (example repo)
- A Dockerfile in the GitHub repo that does a source-to-image style build
- A container registry (such as Docker Hub or IBM Cloud Container Registry)
- The Kubernetes Operator for Apache Spark installed on your cluster
- The
default
service account in your namespace has the RBAC permissions necessary to run SparkApplications - A SparkApplication resource definition for your (job)
Estimated Time
This tutorial should take you approximately 15 to 20 minutes to complete.
Steps
Install Tekton Pipelines
Install Tekton Pipelines on your cluster by running the following command:
kubectl apply --filename https://storage.googleapis.com/tekton-releases/pipeline/latest/release.yaml
Note: At the time of writing, this command installs release v0.22.0. You can also install a different version by following the instructions to install a specific release.
Install Tekton CLI and Dashboard
While not necessary, you can optionally install tools to make it easier to view the status of your Tekton jobs. If you want a command line-based option, install the Tekton CLI by following the operating system-appropriate instructions.
For a graphical option, you can install the Tekton Dashboard by running the following command:
kubectl apply --filename https://github.com/tektoncd/dashboard/releases/latest/download/tekton-dashboard-release.yaml
Note: At the time of writing, this command installs release v0.15.0. You can also install a different version by following the instructions to install a specific release.
Install Tasks to the cluster
Now that you have installed Tekton, you can go about setting up your build pipeline. Your build process looks like this:
- Clone your git repository.
- Build your container image.
- Push your container to a container repository.
- Deploy your application to the cluster.
Tekton provides a catalog of common Tasks that you can use instead of having to reinvent the wheel. For this tutorial, there are preexisting Tasks for each step of the process.
To clone your repository, use the git-clone
Task:
kubectl apply -f https://raw.githubusercontent.com/tektoncd/catalog/master/task/git-clone/0.2/git-clone.yaml
The build and push steps are combined into a single kaniko
Task:
kubectl apply -f https://raw.githubusercontent.com/tektoncd/catalog/master/task/kaniko/0.1/kaniko.yaml
To deploy your image, use the kubectl-deploy-pod
Task:
kubectl apply -f https://raw.githubusercontent.com/tektoncd/catalog/master/task/kubectl-deploy-pod/0.1/kubectl-deploy-pod.yaml
That’s it. Now it’s time to combine these Tasks into a pipeline.
Create a pipeline to run Tasks
Next, you create a pipeline to run your Tasks. The pipeline runs the git-clone
Task, followed by the kaniko
Task, and then finally the kubectl-deploy-pod
Task.
Save the following code into a file called pipeline.yaml
:
apiVersion: tekton.dev/v1beta1
kind: Pipeline
metadata:
name: clone-build-push
spec:
workspaces:
- name: shared-data
params:
- name: url
description: git url to clone (required)
- name: IMAGE
description: the name (reference) of the image to build (required)
- name: DOCKERFILE
description: the path to the Dockerfile to execute (default ./Dockerfile)
- name: CONTEXT
description: the build context used by Kaniko (default ./)
- name: action
description: the action to perform to the resource (required)
- name: manifest
description: the content of the resource to deploy (required)
- name: success-condition
description: condition that indicates deploy was successful (optional)
- name: failure-condition
description: condition that indicates deploy failed(optional)
- name: output
description: values from completed resource values to extract and write to /tekton/results/$(name)
tasks:
- name: clone
taskRef:
name: git-clone
workspaces:
- name: output
workspace: shared-data
params:
- name: url
value: "$(params.url)"
- name: build-push
taskRef:
name: kaniko
runAfter:
- clone
params:
- name: IMAGE
value: "$(params.IMAGE)"
- name: DOCKERFILE
value: "$(params.DOCKERFILE)"
- name: CONTEXT
value: "$(params.CONTEXT)"
workspaces:
- name: source
workspace: shared-data
- name: deploy
taskRef:
name: kubectl-deploy-pod
runAfter:
- build-push
params:
- name: action
value: "$(params.action)"
- name: manifest
value: "$(params.manifest)"
- name: success-condition
value: "$(params.success-condition)"
- name: failure-condition
value: "$(params.failure-condition)"
- name: output
value: "$(params.output)"
Apply pipeline.yaml to the cluster by running the following:
kubectl apply -f pipeline.yaml
Now that your pipeline is ready, you just need to configure the credentials you need to run it.
Deploy a secret for a container registry
The last thing you need to do is set up your credentials to authenticate to your Docker repository. You do this by creating a registry secret so that you can push your image to a container registry.
To create this secret, fill in the appropriate values and run the following command:
kubectl create secret docker-registry regcred \
--docker-server=<server> \
--docker-username=<username> \
--docker-password=<password> \
--docker-email=<email>
If you’re using Docker Hub as your registry, you can omit the docker-server
option. If you are using other registries, you might need to include the tekton.dev/docker-0
annotation for your secret by running the following:
kubectl annotate secret regcred tekton.dev/docker-0=<server>
Finally, add this secret to your default
service account by running the following:
kubectl patch sa default --type='json' -p='[{"op": "add", "path": "/secrets/1", "value": {"name": "regcred" } }]'
Now that the setup is complete, you can run your job.
Create a PipelineRun to run the pipeline
To run your pipeline, you need to create a PipelineRun. The PipelineRun provides the details to your pipeline and Task at execution time.
apiVersion: tekton.dev/v1beta1
kind: PipelineRun
metadata:
name: build-spark-with-tekton
spec:
pipelineRef:
name: clone-build-push
workspaces:
- name: shared-data
volumeClaimTemplate:
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Gi
params:
- name: url
value: "https://github.com/IBM/spark-tekton"
- name: IMAGE
value: "psschwei/containerized-app:latest"
- name: DOCKERFILE
value: "s2i.Dockerfile"
- name: CONTEXT
value: "."
- name: action
value: "create"
- name: output
value: |
- name: job-name
valueFrom: '{.metadata.name}'
- name: job-namespace
valueFrom: '{.metadata.namespace}'
- name: success-condition
value: "status.applicationState.state = COMPLETED"
- name: failure-condition
value: "status.applicationState.state = FAILED"
- name: manifest
value: |
apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
name: spark-pi
labels:
job: spark-test
spec:
sparkVersion: 2.4.5
type: Scala
mode: cluster
image: psschwei/containerized-app:latest
imagePullPolicy: Always
mainClass: SparkPi
mainApplicationFile: local:///opt/spark/target/scala-2.11/kubernetes-assembly-1.0.0.jar
driver:
memory: 1g
executor:
instances: 2
memory: 2g
Note that you need to update the previous YAML to use your own container registry namespace.
If you’re using the sample GitHub repo, note that you are using s2i.Dockerfile
for this build, not Dockerfile
. The s2i
Dockerfile runs a multistage build that creates a container image from source code, whereas the regular Dockerfile in this repo assumes that you have already created your Scala binary.
Save the following code into a file called pipelinerun.yaml
and run it via the following command:
kubectl apply -f pipelinerun.yaml
That’s it. Your job is now running.
Monitor the status of PipelineRun
To monitor the status of your job in real time, you can run the following command to view the logs (note that you need to add the appropriate hash to the name of your pod):
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
affinity-assistant-0e390a2f80-0 1/1 Running 0 117s
build-spark-with-tekton-build-push-x5fgk-pod-7g82k 3/3 Running 0 18s
build-spark-with-tekton-clone-65php-pod-mhs2n 0/1 Completed 0 117s
$ kubectl logs build-spark-with-tekton-build-push-x5fgk-pod-7g82k
error: a container name must be specified for pod build-spark-with-tekton-build-push-x5fgk-pod-7g82k, choose one of: [step-build-and-push step-write-digest step-digest-to-results] or one of the init containers: [place-scripts working-dir-initializer place-tools]
$ kubectl logs build-spark-with-tekton-build-push-x5fgk-pod-7g82k step-build-and-push | tail -n 10
INFO[0028] args: [-c mkdir /wkdir]
INFO[0028] Running: [/bin/sh -c mkdir /wkdir]
INFO[0028] Taking snapshot of full filesystem...
INFO[0029] COPY . /wkdir
INFO[0030] Taking snapshot of files...
INFO[0030] RUN cd /wkdir && sbt assembly
INFO[0030] cmd: /bin/sh
INFO[0030] args: [-c cd /wkdir && sbt assembly]
INFO[0030] Running: [/bin/sh -c cd /wkdir && sbt assembly]
[info] [launcher] getting org.scala-sbt sbt 1.4.3 (this may take some time)...
If you installed the Tekton CLI tool or the dashboard earlier, you can also use them to monitor your job.
$ tkn pr describe build-spark-with-tekton
Name: build-spark-with-tekton
Namespace: default
Pipeline Ref: clone-build-push
Service Account: default
Timeout: 1h0m0s
Labels:
tekton.dev/pipeline=clone-build-push
π‘οΈ Status
STARTED DURATION STATUS
9 minutes ago 6 minutes Succeeded
π¦ Resources
No resources
β Params
NAME VALUE
β url https://github.com/psschwei/containerized-app.git
β IMAGE psschwei/containerized-app:latest
β DOCKERFILE s2i.Dockerfile
β CONTEXT .
β action create
β output - name: job-name
valueFrom: '{.metadata.name}'
- name: job-namespace
valueFrom: '{.metadata.namespace}'
β success-condition status.applicationState.state = COMPLETED
β failure-condition status.applicationState.state = FAILED
β manifest apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
name: spark-pi
labels:
job: spark-test
spec:
sparkVersion: 2.4.5
type: Scala
mode: cluster
image: psschwei/containerized-app:latest
imagePullPolicy: Always
mainClass: SparkPi
mainApplicationFile: local:///opt/spark/target/scala-2.11/kubernetes-assembly-1.0.0.jar
driver:
memory: 1g
executor:
instances: 2
memory: 2g
π Results
No results
π Workspaces
NAME SUB PATH WORKSPACE BINDING
β shared-data --- VolumeClaimTemplate
π Taskruns
NAME TASK NAME STARTED DURATION STATUS
β build-spark-with-tekton-deploy-nbsv6 deploy 4 minutes ago 1 minute Succeeded
β build-spark-with-tekton-build-push-75xq2 build-push 7 minutes ago 3 minutes Succeeded
β build-spark-with-tekton-clone-n62f9 clone 9 minutes ago 1 minute Succeeded
All that’s left to do now is wait for your PipelineRun to finish.
Summary
Congratulations! Youβve successfully built a Tekton Pipeline that you can use to deploy your Apache Spark job. As a next step, look into running your job on a schedule using a ScheduledSparkApplication. And after that, use Tekton Triggers to rebuild your application automatically with every code change you push to GitHub.