I use this function to find if the pattern is in the column and replace it with the replacement but it does not give.
Can someone tell where I make the mistakes?
patterns = [
'15/19',
'14/11',
'HTP',
'VTP'
]
replacements = [
'S15/19',
'S11/14',
'HTP',
'VTP'
]
def formate_column(output_column, df, patterns, replacements):
for p, r in zip(patterns, replacements):
df = (
df.withColumn(output_column, F.when(F.col("column").contains(p), F.regexp_replace(F.col("column"), p, r)).otherwise(F.col("column")))
)
return df
CodePudding user response:
You could move your both lists into one dictionary. Then, looping will be easy and efficient. You don't need when
/otherwise
. It's enough to use regexp_replace
, because if a match is not found, nothing is being replaced in the string.
Input:
from pyspark.sql import functions as F
df = spark.createDataFrame([("xx_15/19_",), ("_14/11111",), ("no",)], ["column"])
df.show()
# ---------
# | column|
# ---------
# |xx_15/19_|
# |_14/11111|
# | no|
# ---------
Script:
patterns = {
r'15/19' :'S15/19',
r'14/11' :'S11/14',
r'HTP' :'HTP',
r'VTP' :'VTP'
}
for k, v in patterns.items():
df = df.withColumn("column", F.regexp_replace("column", k, v))
df.show()
# ----------
# | column|
# ----------
# |xx_S15/19_|
# |_S11/14111|
# | no|
# ----------
df.explain()
# == Physical Plan ==
# *(1) Project [regexp_replace(regexp_replace(regexp_replace(regexp_replace(column#619, 15/19, S15/19, 1), 14/11, S11/14, 1), HTP, HTP, 1), VTP, VTP, 1) AS column#632]
# - *(1) Scan ExistingRDD[column#619]