Home > Mobile >  Ultra Basic Question on PySpark with Kubernetes
Ultra Basic Question on PySpark with Kubernetes

Time:12-01

After fighting with the lack of documentation and wildly misleading information out there on PySpark with Kubernetes I think I have boiled this down to one question. How do I get the driver pod that gets spun up to read my python file (not a dependency, the actual file itself)? Here's the command I'm using:

kubectl run --namespace apache-spark apache-spark-client --rm --tty -i --restart='Never' \
--image docker.io/bitnami/spark:3.1.2-debian-10-r44 \
-- spark-submit --master spark://10.120.112.210:30077 \
test.py

Here's what I get back:

python3: can't open file '/opt/bitnami/spark/test.py': [Errno 2] No such file or directory

OK, so how do I get this python file onto the driver pod? This vital piece of information seems to be completely missing from hundreds of articles on the subject. I have mounted volumes that the workers can see and tried that as the path. Still doesn't work. So I'm assuming it has to be on the driver pod. But how? Every example just throws in the .py file without any mention of how it gets there.

CodePudding user response:

You are not mounting any volume to the pod, so even if the file is present in the NFS mount, it won't be visible from within the pod. You must mount it. In the following command, you are creating a pod but not attaching any volume to it.

kubectl run --namespace apache-spark apache-spark-client --rm --tty -i --restart='Never' \
--image docker.io/bitnami/spark:3.1.2-debian-10-r44 \
-- spark-submit --master spark://10.120.112.210:30077 \
test.py

If you wish to use NFS volume, you need to use the right PVC or hostPath to the NFS mount. TLDR, Mount the volume.

Alternatively: You can refer to this example if you wish to use configMap and volumes to make a local file available inside the pod. In this example, I have created info.log file locally on the server where I run kubectl commands.

// Create a test file in my workstation

echo "This file is written in my workstation, not inside the pod" > info.log

// create a config-map of the file:

kubectl  create cm test-cm --from-file info.log
configmap/test-cm created

// mount the configmap as volume, notice the volumes and volumeMounts section:

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    run: test-pod
  name: test-pod
spec:
  nodeName: k8s-master
  containers:
  - command:
    - sleep
    - infinity
    image: ubuntu
    name: test-pod
    resources: {}
    volumeMounts:
     - name: my-vol
       mountPath: /tmp
  dnsPolicy: ClusterFirst
  restartPolicy: Always
  volumes:
  - name: my-vol
    configMap:
      name: test-cm

status: {}

// Test now, using the volume, I can access the info.log file from within the pod.

kuebctl exec -it test-pod  -- bash
root@test-pod:/# cd /tmp/
root@test-pod:/tmp# ls
info.log
root@test-pod:/tmp# cat info.log
This file is written in my workstation, not inside the pod
root@test-pod:/tmp#
  • Related