I am able to submit and run a pyspark script I wrote.
Command to Run the script pyspark_client.py:
❯ clear;/opt/spark/bin/pyspark < pyspark_client.py
I am also able to run a python-kafka script separately without any issues. This means I have the module installed.
But the issue arises when i combine both the scripts and submit it to pyspark.
from pyspark import SparkContext
import pyspark
from pyspark.sql.session import SparkSession
from pyspark.streaming import StreamingContext
from pyspark.sql.functions import *
from json import dumps
from kafka import KafkaProducer
It cannot find the python-kafka module. The error is:
Traceback (most recent call last):
File "<stdin>", line 8, in <module>
ModuleNotFoundError: No module named 'kafka'
All I want to do is write the counts of tweets in each window to a kafka topic. The way I get the topic and count from a window is that I iterate through rows of the the window df.
topic,count = row.__getitem__("hashtag").__getitem__("text"),row.__getitem__("count")
Minimum reproducable code for error:
from pyspark import SparkContext
import pyspark
from pyspark.sql.session import SparkSession
from pyspark.streaming import StreamingContext
from pyspark.sql.functions import *
from json import dumps
from kafka import KafkaProducer
print("Hello")
Version details:
- kafka-python==2.0.2
- pyspark==3.2.1
CodePudding user response:
Pyspark has its own Kafka support, that doesn't depend on kafka-python
module. You should remove it, along with the import. You should also be using spark-submit pyspark_client.py
to run the code, not input redirection into the pyspark interpreter.
Alternatively, assuming you're using tweepy, you don't need Spark to produce data to Kafka