filter stop words from text column - spark SQL-CodePudding

I'm using spark SQL and have a data frame with user IDs & reviews of products. I need to filter stop words from the reviews, and I have a text file with stop words to filter.

I managed to split the reviews to lists of strings, but don't know how to filter.

this is what I tried to do:

from pyspark.sql.functions import col

stopWords = spark.read.text('/FileStore/tables/english.txt')

df.select(split(col("reviewText")," "))

df.filter(col("reviewText") == stopWords)

thanks!

CodePudding user response：

You are a little vague in that you do not allude to the flatMap approach, which is more common.

Here an alternative just examining the dataframe column.

import pyspark.sql.functions as F
from pyspark.sql.functions import regexp_extract 

stopWordsIn = spark.read.text('/FileStore/tables/sw.txt').rdd.flatMap(lambda line: line.value.split(" "))
stopWords = stopWordsIn.collect()
print(stopWords)
 
words = spark.read.text('/FileStore/tables/df.txt')
words = words.withColumn('value_1', F.lower(F.regexp_replace('value', "[^0-9a-zA-Z^ ] ", "")))
words = words.withColumn('value_2', F.regexp_replace('value_1', '\\b('   '|'.join(stopWords)   ')\\b', ''))
words.show()

returns - and filter out the columns you do not want.

['a', 'in', 'the']


 -------------- ------------- ------------- 
|         value|      value_1|      value_2|
 -------------- ------------- ------------- 
|      A quick2|     A quick2|       quick2|
|brown fox was#|brown fox was|brown fox was|
| in the house.| in the house|        house|
 -------------- ------------- -------------

You see the stop words and the fact that I converted all to lower case and stripped some stuff out.