I'm using spark SQL and have a data frame with user IDs & reviews of products. I need to filter stop words from the reviews, and I have a text file with stop words to filter.
I managed to split the reviews to lists of strings, but don't know how to filter.
this is what I tried to do:
from pyspark.sql.functions import col
stopWords = spark.read.text('/FileStore/tables/english.txt')
df.select(split(col("reviewText")," "))
df.filter(col("reviewText") == stopWords)
thanks!
CodePudding user response:
You are a little vague in that you do not allude to the flatMap approach, which is more common.
Here an alternative just examining the dataframe column.
import pyspark.sql.functions as F
from pyspark.sql.functions import regexp_extract
stopWordsIn = spark.read.text('/FileStore/tables/sw.txt').rdd.flatMap(lambda line: line.value.split(" "))
stopWords = stopWordsIn.collect()
print(stopWords)
words = spark.read.text('/FileStore/tables/df.txt')
words = words.withColumn('value_1', F.lower(F.regexp_replace('value', "[^0-9a-zA-Z^ ] ", "")))
words = words.withColumn('value_2', F.regexp_replace('value_1', '\\b(' '|'.join(stopWords) ')\\b', ''))
words.show()
returns - and filter out the columns you do not want.
['a', 'in', 'the']
-------------- ------------- -------------
| value| value_1| value_2|
-------------- ------------- -------------
| A quick2| A quick2| quick2|
|brown fox was#|brown fox was|brown fox was|
| in the house.| in the house| house|
-------------- ------------- -------------
You see the stop words and the fact that I converted all to lower case and stripped some stuff out.