hope someone can help with a simple sentiment analysis in Pyspark. I have a Pyspark dataframe where each row contains a word
. I also have a dictionary of common stopwords
.
I want to remove the rows where the word
(value of the row) is in the stopwords
dict.
Input:
-------
| word |
-------
| the|
| food|
| is|
|amazing|
| and|
| great|
-------
stopwords = {'the', 'is', 'and'}
Expected Output:
-------
| word |
-------
| food|
|amazing|
| great|
-------
CodePudding user response:
Use negative isin
:
df = df.filter(~F.col("word").isin(stop_words))
where stop_words
:
stop_words = {"the", "is", "and"}
Result:
-------
|word |
-------
|food |
|amazing|
|great |
-------
CodePudding user response:
You can create dataframe using the set of stopwords
then join with input dataframe using left_anti
join:
stopwords_df = spark.createDataFrame([[w] for w in stopwords], ["word"])
result_df = input_df.join(stopwords_df, ["word"], "left_anti")
result_df.show()
# -------
#| word|
# -------
#|amazing|
#| food|
#| great|
# -------