How to use regexp_replace with contains?-CodePudding

I use this function to find if the pattern is in the column and replace it with the replacement but it does not give.

Can someone tell where I make the mistakes?

patterns = [
    '15/19',
    '14/11',
    'HTP',
    'VTP'
]

replacements = [
    'S15/19',
    'S11/14',
    'HTP',
    'VTP'
]
def formate_column(output_column, df, patterns, replacements):
    for p, r in zip(patterns, replacements):
        df = (
            df.withColumn(output_column, F.when(F.col("column").contains(p), F.regexp_replace(F.col("column"), p, r)).otherwise(F.col("column")))
            
    )
    return df

CodePudding user response：

You could move your both lists into one dictionary. Then, looping will be easy and efficient. You don't need when/otherwise. It's enough to use regexp_replace, because if a match is not found, nothing is being replaced in the string.

Input:

from pyspark.sql import functions as F
df = spark.createDataFrame([("xx_15/19_",), ("_14/11111",), ("no",)], ["column"])
df.show()
#  --------- 
# |   column|
#  --------- 
# |xx_15/19_|
# |_14/11111|
# |       no|
#  ---------

Script:

patterns = {
    r'15/19'  :'S15/19',
    r'14/11'  :'S11/14',
    r'HTP'    :'HTP',
    r'VTP'    :'VTP'
}

for k, v in patterns.items():
    df = df.withColumn("column", F.regexp_replace("column", k, v))
df.show()
#  ---------- 
# |    column|
#  ---------- 
# |xx_S15/19_|
# |_S11/14111|
# |        no|
#  ---------- 

df.explain()
# == Physical Plan ==
# *(1) Project [regexp_replace(regexp_replace(regexp_replace(regexp_replace(column#619, 15/19, S15/19, 1), 14/11, S11/14, 1), HTP, HTP, 1), VTP, VTP, 1) AS column#632]
#  - *(1) Scan ExistingRDD[column#619]