Home > Enterprise >  Use dictionary as part of replace_regexp in Pyspark
Use dictionary as part of replace_regexp in Pyspark

Time:08-12

I am trying to use a dictionary like this:

mydictionary = {'AL':'Alabama', '(AL)': 'Alabama', 'WI':'Wisconsin','GA': 'Georgia','(GA)': 'Georgia'}

To go through a spark dataframe:

data = [{"ID": 1, "TheString": "On WI ! On WI !"},
            {"ID": 2, "TheString": "The state of AL is next to GA"},
            {"ID": 3, "TheString": "The state of (AL) is also next to (GA)"},        
            {"ID": 4, "TheString": "Alabama is in the South"},
            {"ID": 5, "TheString": 'Wisconsin is up north way'}
            ]
sdf = spark.createDataFrame(data)
display(sdf)

And replace the substring found in the value part of the dictionary with matching substrings to the key.

So, something like this:

for k, v in mydictionary.items():
    replacement_expr = regexp_replace(col("TheString"), '(\s )' k, v)
    print(replacement_expr)
                      
sdf.withColumn("TheString_New", replacement_expr).show(truncate=False)

(this of course does not work; the regular expression being compiled is wrong) A few things to note: The abbreviation has either a space before and after, or left and right parentheses. I think the big problem here is that I can't get the re to "compile" correctly across the dictionary elements. (And then also throw in the "space or parentheses" restriction noted.) I realize I could get rid of the (GA) with parentheses keys (and just use GA with spaces or parentheses as boundaries), but it seemed simpler to have those cases in the dictionary.

Expected result:

On Wisconsin ! On Wisconsin !
The state of Alabama is next to Georgia
The state of (Alabama) is next to (Georgia)
Alabama is in the South
Wisconsin is way up north

Your help is much appreciated.

Some close solutions I've looked at: Replace string based on dictionary pyspark

CodePudding user response:

Use \b in regex to specify word boundary. Also, you can use functools.reduce to generate the replace expression from the dict itemps like this:

from functools import reduce
from pyspark.sql import functions as F,


replace_expr = reduce(
    lambda a, b: F.regexp_replace(a, rf"\b{b[0]}\b", b[1]),
    mydictionary.items(),
    F.col("TheString")
)

sdf.withColumn("TheString", replace_expr).show(truncate=False)

#  --- ------------------------------------------------ 
# |ID |TheString                                       |
#  --- ------------------------------------------------ 
# |1  |On Wisconsin ! On Wisconsin !                   |
# |2  |The state of Alabama is next to Georgia         |
# |3  |The state of (Alabama) is also next to (Georgia)|
# |4  |Alabama is in the South                         |
# |5  |Wisconsin is up north way                       |
#  --- ------------------------------------------------ 
  • Related