Run Python mrjob in a Kubernetes on Hadoop Cluster-CodePudding

I'm exploring this python package mrjob to run MapReduce jobs in python. I've tried running it in the local environment and it works perfectly.

I have Hadoop 3.3 runs on Kubernetes (GKE) cluster. So I also managed to run mrjob successfully in the name-node pod from inside.

Now, I've got a Jupyter Notebook pod running in the same Kubernetes cluster (same namespace). I wonder whether I can run mrjob MapReduce jobs from the Jupyter Notebook.

The problem seems to be that I don't have $HADOOP_HOME defined in the Jupyter Notebook environment. So based on the documentation I created a config file called mrjob.conf as follows;

runners:
 hadoop:
  cmdenv:
    PATH: <pod name>:/opt/hadoop

However mrjob is still unable to detect hadoop bin and gives the below error

FileNotFoundError: [Errno 2] No such file or directory: 'hadoop'

So is there a way in which I can configure mrjob to run with my existing Hadoop installation on the GKE cluster? I've tried searching for similar examples but was unable to find one.

CodePudding user response：

mrjob is a wrapper around hadoop-streaming, therefore requires Hadoop binaries to be installed on the server(s) where the code will run (pods here, I guess); including the Juptyer pod that submits the application.

IMO, it would be much easier for you to deploy PySpark/PyFlink/Beam applications in k8s than hadoop-streaming since you don't "need" Hadoop in k8s to run such distributed processes.

Beam would be recommended since it is compatible with GCP DataFlow