I was trying to clean words in list using the following code:
#define function to clean list of words
def clear_list(words_list):
regex = re.compile('[\w\d]{2,}', re.U)
filtered = [i for i in words_list if regex.match(i)]
return filtered
clear_list_udf = sf.udf(clear_list, ArrayType(StringType()))
items = items.withColumn("clear_words", clear_list_udf(sf.col("words")))
I need just words bigger than 1 letter without punctuation. But I have the problem in the following cases:
what I have:
["""непутевые, заметки"", с, дмитрием, крыловым"] -->
[заметки"", дмитрием, крыловым"]
what I need:
["""непутевые, заметки"", с, дмитрием, крыловым"] -->
[непутевые, заметки, дмитрием, крыловым]
CodePudding user response:
You can use regexp_replace and then filter on the df to achieve the result in pyspark itself.
We should avoid using UDF as much as possible because UDF is like a black box to spark. It can not apply optimizations on it efficiently. Read more here
from pyspark.sql.functions import regexp_replace, col, length
df = df.select(regexp_replace(col("col_name"), "[^a-zA-Z0-9]", ""))
df = df.where(length(col("col_name")) >= 2)
CodePudding user response:
Replace this line:
filtered = [i for i in words_list if regex.match(i)]
With this line:
filtered = [regex.search(i).group() for i in words_list if regex.search(i)]
The regular expression given is good, but the for loop returns the original value, not the matching string. Code sample:
regex = re.compile('[\w\d]{2,}', re.U)
words_list = ['""word', 'wor"', 'c', "test"]
filtered = [regex.search(i).group() for i in words_list if regex.search(i)]
print(filtered)
> ['word', 'wor', 'test']