PySpark NoneType in data despite filtering-CodePudding

I am using PySpark for the first time and I am going nuts. It seems like the None values are not filtered from my df despite the filter function.

from nltk.stem.porter import PorterStemmer
from pyspark.sql.functions import col

stemmer = PorterStemmer()

articles = articles.rdd \
.map(lambda r: (" ".join(stemmer.stem(r["content"])),)) \
.toDF(["content"]) \
.where(col("content").isNotNull())

print(articles.count())

This is the error I get:
org.apache.spark.SparkException:
Job aborted due to stage failure:
Task 2 in stage 359.0 failed 1 times, most recent failure: Lost task 2.0 in stage 359.0 org.apache.spark.api.python.PythonException:
'AttributeError: 'NoneType' object has no attribute 'lower'', from , line 11.
Full traceback below: ...

I am suspecting that the map or the lambda function is the culprit somehow, because if I change it to

lambda r: ('stemmer.stem(r["content"])',).

Then it suddenly works.

What do you think what is causing my issue? Or should I try some other approach to map a column?

CodePudding user response：

The exception could be from the PorterStemmer.stem()(https://github.com/nltk/nltk/blob/develop/nltk/stem/porter.py#L658). You can filter on r["content"] == None before applying the map.