Python kafka module when used with pyspark causes 'ModuleNotFound' error?-CodePudding

I am able to submit and run a pyspark script I wrote.

Command to Run the script pyspark_client.py:

❯ clear;/opt/spark/bin/pyspark < pyspark_client.py

I am also able to run a python-kafka script separately without any issues. This means I have the module installed.

But the issue arises when i combine both the scripts and submit it to pyspark.

from pyspark import SparkContext
import pyspark
from pyspark.sql.session import SparkSession
from pyspark.streaming import StreamingContext
from pyspark.sql.functions import *
from json import dumps
from kafka import KafkaProducer

It cannot find the python-kafka module. The error is:

Traceback (most recent call last):
  File "<stdin>", line 8, in <module>
ModuleNotFoundError: No module named 'kafka'

All I want to do is write the counts of tweets in each window to a kafka topic. The way I get the topic and count from a window is that I iterate through rows of the the window df.

topic,count = row.__getitem__("hashtag").__getitem__("text"),row.__getitem__("count")

Minimum reproducable code for error:

from pyspark import SparkContext
import pyspark
from pyspark.sql.session import SparkSession
from pyspark.streaming import StreamingContext
from pyspark.sql.functions import *
from json import dumps
from kafka import KafkaProducer

print("Hello")

Version details:

kafka-python==2.0.2
pyspark==3.2.1

CodePudding user response：

Pyspark has its own Kafka support, that doesn't depend on kafka-python module. You should remove it, along with the import. You should also be using spark-submit pyspark_client.py to run the code, not input redirection into the pyspark interpreter.

Alternatively, assuming you're using tweepy, you don't need Spark to produce data to Kafka