Regex: Get rid of consecutive punctuation-CodePudding

I was trying to clean words in list using the following code:

#define function to clean list of words
def clear_list(words_list):
    regex = re.compile('[\w\d]{2,}', re.U)
    filtered = [i for i in words_list if regex.match(i)]
    return filtered

clear_list_udf = sf.udf(clear_list, ArrayType(StringType()))

items = items.withColumn("clear_words", clear_list_udf(sf.col("words")))

I need just words bigger than 1 letter without punctuation. But I have the problem in the following cases:

what I have:
["""непутевые, заметки"", с, дмитрием, крыловым"] -->
[заметки"", дмитрием, крыловым"]

what I need:
["""непутевые, заметки"", с, дмитрием, крыловым"] -->
[непутевые, заметки, дмитрием, крыловым]

CodePudding user response：

You can use regexp_replace and then filter on the df to achieve the result in pyspark itself.

We should avoid using UDF as much as possible because UDF is like a black box to spark. It can not apply optimizations on it efficiently. Read more here

from pyspark.sql.functions import regexp_replace, col, length

df = df.select(regexp_replace(col("col_name"), "[^a-zA-Z0-9]", ""))
df = df.where(length(col("col_name")) >= 2)

CodePudding user response：

Replace this line:

filtered = [i for i in words_list if regex.match(i)]

With this line:

filtered = [regex.search(i).group() for i in words_list if regex.search(i)]

The regular expression given is good, but the for loop returns the original value, not the matching string. Code sample:

regex = re.compile('[\w\d]{2,}', re.U)
words_list = ['""word', 'wor"', 'c', "test"]
filtered = [regex.search(i).group() for i in words_list if regex.search(i)]
print(filtered)
> ['word', 'wor', 'test']