I have a column in a Spark Dataframe that contains a list of strings. I am hoping to do the following and am not sure how:
- Search the column for the presence of a substring, if this substring is present, replace that string with a word.
- If one of the desired substrings is not present, then replace the string with 'other'
Sample SDF:
data = [
[1, ["EQUCAB", "EQUCAS", "CANMA"]],
[2, ["CANMA", "FELCAT", "SUSDO"]],
[3, ["BOSTAU", "EQUCAB"]],
]
df = pd.DataFrame(data, columns=["Item", "String"])
df["String"] = [",".join(map(str, l)) for l in df["String"]]
sdf = spark.createDataFrame(df)
Desired output:
data = [
[1, ["horse", "horse", "other"]],
[2, ["other", "cat", "other"]],
[3, ["cow", "horse"]],
]
df = pd.DataFrame(data, columns=["Item", "String"])
df["String"] = [",".join(map(str, l)) for l in df["String"]]
so basically, any element that contains EQU
is assigned horse
, any element that contains FEL
is assigned cat
, any element that contains BOS
is assigned cow
, and the rest are assigned other.
CodePudding user response:
from pyspark.sql.functions import when, col, lit
df = df.withColum("String",
when(col('String').contains('EQU'), lit('horse'))
.when(col('String').contains('FEL'), lit('cat'))
.when(col('String').contains('BOS'), lit('cow')).otherwise(lit('other')))