Python Pyspark - Text Analysis / Removing rows if word (value of row) is in a dictionary of stopword-CodePudding

hope someone can help with a simple sentiment analysis in Pyspark. I have a Pyspark dataframe where each row contains a word. I also have a dictionary of common stopwords.

I want to remove the rows where the word (value of the row) is in the stopwords dict.

Input:

 ------- 
|  word |
 ------- 
|    the|
|   food|
|     is|
|amazing|
|    and|
|  great|
 ------- 

stopwords = {'the', 'is', 'and'}

Expected Output:

 ------- 
|  word |
 ------- 
|   food|
|amazing|
|  great|
 -------

CodePudding user response：

Use negative isin:

df = df.filter(~F.col("word").isin(stop_words))

where stop_words:

stop_words = {"the", "is", "and"}

Result:

 -------                                                                        
|word   |
 ------- 
|food   |
|amazing|
|great  |
 -------

CodePudding user response：

You can create dataframe using the set of stopwords then join with input dataframe using left_anti join:

stopwords_df = spark.createDataFrame([[w] for w in stopwords], ["word"])

result_df = input_df.join(stopwords_df, ["word"], "left_anti")

result_df.show()
# ------- 
#|   word|
# ------- 
#|amazing|
#|   food|
#|  great|
# -------