Home > Software design >  How to extract exactly the same word with regexp_extract_all in pyspark
How to extract exactly the same word with regexp_extract_all in pyspark

Time:12-04

I am having some issues in finding the correct regular expression

lets say I have this list of keywords:

keywords = [' b.o.o', ' a.b.a', ' titi']

(please keep in mind that there is a blank space before any keyword and this list can contain up to 100keywords so I can't to it without a function) and my dataframe df:

enter image description here

I use the following code to extract the matching words, it works partially because it extract even the words that are not an exact match :

keywords = [' b.o.o', ' a.b.a', ' titi']

pattern = '('   '|'.join([fr'\\b({k})\\b' for k in keywords])   ')'

df.withColumn('words', F.expr(f"regexp_extract_all(colB, '{pattern}' ,1)))

the output :

enter image description here

But here is the expected output :

enter image description here As we can see, it does extract words that are not exact match, it does not take into account the dot. For example, this code considers awbwa as a match because if we replace w by a dot it will be a match. I also tried

pattern = '('   '|'.join([fr'\\b({k})\\b' for k in [re.escape(x) for x in keywords]])   ')'

to add a backslash before every dot and before the blank space but it doesnt work.

Thank you so much for your help (btw I looked everywhere on stackoverflow and I didnt find an answer to this)

CodePudding user response:

I think you need to add a backslash before the dot in your regular expression pattern to escape it, so it's treated as a literal dot and not a special character that matches any character.

In your code, you can try using the re.escape() method from the re module to escape all special characters in the keywords list before joining them in the pattern. Here's an example:

import re

keywords = [' b.o.o', ' a.b.a', ' titi']

# Escape special characters in the keywords using re.escape()
escaped_keywords = [re.escape(keyword) for keyword in keywords]

# Join the escaped keywords with '|' as the separator
pattern = '('   '|'.join(escaped_keywords)   ')'

# Use the pattern in your regexp_extract_all() call
df.withColumn('words', F.expr(f"regexp_extract_all(colB, '{pattern}' ,1)"))

This should give you the expected output where only exact matches are extracted.

CodePudding user response:

You can use the \b word boundary metacharacter to match whole words only, and escape the dots with a backslash \. in your regular expression.

Here is an example:

import pyspark.sql.functions as F

keywords = [' b.o.o', ' a.b.a', ' titi']

# Escape dots and add word boundaries
pattern = '('   '|'.join([fr'\b({k.replace(".", "\\.")})\b' for k in keywords])   ')'

df.withColumn('words', F.expr(f"regexp_extract_all(colB, '{pattern}' ,1)))

This will match b.o.o, a.b.a, and titi as whole words, and will not match substrings like awbwa.

  • Related