After fighting with the lack of documentation and wildly misleading information out there on PySpark with Kubernetes I think I have boiled this down to one question. How do I get the driver pod that gets spun up to read my python file (not a dependency, the actual file itself)? Here's the command I'm using:
kubectl run --namespace apache-spark apache-spark-client --rm --tty -i --restart='Never' \
--image docker.io/bitnami/spark:3.1.2-debian-10-r44 \
-- spark-submit --master spark://10.120.112.210:30077 \
test.py
Here's what I get back:
python3: can't open file '/opt/bitnami/spark/test.py': [Errno 2] No such file or directory
OK, so how do I get this python file onto the driver pod? This vital piece of information seems to be completely missing from hundreds of articles on the subject. I have mounted volumes that the workers can see and tried that as the path. Still doesn't work. So I'm assuming it has to be on the driver pod. But how? Every example just throws in the .py file without any mention of how it gets there.
CodePudding user response:
You are not mounting any volume to the pod, so even if the file is present in the NFS mount, it won't be visible from within the pod. You must mount it. In the following command, you are creating a pod but not attaching any volume to it.
kubectl run --namespace apache-spark apache-spark-client --rm --tty -i --restart='Never' \
--image docker.io/bitnami/spark:3.1.2-debian-10-r44 \
-- spark-submit --master spark://10.120.112.210:30077 \
test.py
If you wish to use NFS
volume, you need to use the right PVC or hostPath to the NFS mount. TLDR, Mount the volume.
Alternatively:
You can refer to this example if you wish to use configMap
and volumes to make a local file available inside the pod. In this example, I have created info.log
file locally on the server where I run kubectl commands.
// Create a test file in my workstation
echo "This file is written in my workstation, not inside the pod" > info.log
// create a config-map
of the file:
kubectl create cm test-cm --from-file info.log
configmap/test-cm created
// mount the configmap as volume, notice the volumes and volumeMounts section:
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: null
labels:
run: test-pod
name: test-pod
spec:
nodeName: k8s-master
containers:
- command:
- sleep
- infinity
image: ubuntu
name: test-pod
resources: {}
volumeMounts:
- name: my-vol
mountPath: /tmp
dnsPolicy: ClusterFirst
restartPolicy: Always
volumes:
- name: my-vol
configMap:
name: test-cm
status: {}
// Test now, using the volume, I can access the info.log
file from within the pod.
kuebctl exec -it test-pod -- bash
root@test-pod:/# cd /tmp/
root@test-pod:/tmp# ls
info.log
root@test-pod:/tmp# cat info.log
This file is written in my workstation, not inside the pod
root@test-pod:/tmp#