Home > Enterprise >  How to search through strings in Pyspark column and selectively replace some strings (containing spe
How to search through strings in Pyspark column and selectively replace some strings (containing spe

Time:11-12

I have a column in a Spark Dataframe that contains a list of strings. I am hoping to do the following and am not sure how:

  1. Search the column for the presence of a substring, if this substring is present, replace that string with a word.
  2. If one of the desired substrings is not present, then replace the string with 'other'

Sample SDF:

data = [
    [1, ["EQUCAB", "EQUCAS", "CANMA"]],
    [2, ["CANMA", "FELCAT", "SUSDO"]],
    [3, ["BOSTAU", "EQUCAB"]],
]

df = pd.DataFrame(data, columns=["Item", "String"])
df["String"] = [",".join(map(str, l)) for l in df["String"]]

sdf = spark.createDataFrame(df)

Desired output:

data = [
    [1, ["horse", "horse", "other"]],
    [2, ["other", "cat", "other"]],
    [3, ["cow", "horse"]],
]

df = pd.DataFrame(data, columns=["Item", "String"])
df["String"] = [",".join(map(str, l)) for l in df["String"]]

so basically, any element that contains EQU is assigned horse, any element that contains FEL is assigned cat, any element that contains BOS is assigned cow, and the rest are assigned other.

CodePudding user response:

from pyspark.sql.functions import when, col, lit

df = df.withColum("String",
                    when(col('String').contains('EQU'), lit('horse'))
                    .when(col('String').contains('FEL'), lit('cat'))
                    .when(col('String').contains('BOS'), lit('cow')).otherwise(lit('other')))
  • Related