I am using PySpark for the first time and I am going nuts. It seems like the None values are not filtered from my df despite the filter function.
from nltk.stem.porter import PorterStemmer
from pyspark.sql.functions import col
stemmer = PorterStemmer()
articles = articles.rdd \
.map(lambda r: (" ".join(stemmer.stem(r["content"])),)) \
.toDF(["content"]) \
.where(col("content").isNotNull())
print(articles.count())
This is the error I get:
org.apache.spark.SparkException:
Job aborted due to stage failure:
Task 2 in stage 359.0 failed 1 times, most recent failure: Lost task 2.0 in stage 359.0 org.apache.spark.api.python.PythonException:
'AttributeError: 'NoneType' object has no attribute 'lower'', from
, line 11.
Full traceback below: ...
I am suspecting that the map or the lambda function is the culprit somehow, because if I change it to
lambda r: ('stemmer.stem(r["content"])',).
Then it suddenly works.
What do you think what is causing my issue? Or should I try some other approach to map a column?
CodePudding user response:
The exception could be from the PorterStemmer.stem()
(https://github.com/nltk/nltk/blob/develop/nltk/stem/porter.py#L658).
You can filter on r["content"] == None
before applying the map.