I have a dataframe:
"people" "other"
father ...
mother
cat
brother
dog
I'm going to insert new column, if column people
contains word in specific array then modify the content with that of other array, else word will remain the same:
array=['father','mother','brother']
array_new=['dad','mum','bro']
"people" "other" "new"
father ... dad
mother mum
cat cat
brother bro
dog dog
I think to use this:
expression = ("CASE " "".join(["WHEN people LIKE '{}' THEN '{}' ".format(val,array_new[array.index(val)](val)) for val in array]) "ELSE 'None' END")
df_pyspark = df_pyspark.withColumn("new", functions.expr(expression))
I should change the else
condition, but I don't know how to do it, for copy the same word in the else condition.
CodePudding user response:
You use a literal map expression that you create from array
and array_new
using create_map
function:
from pyspark.sql import functions as F
from itertools import chain
mapping = F.create_map(*[F.lit(x) for x in chain(*zip(array, array_new))])
df1 = df.withColumn("new", F.coalesce(mapping[F.col("people")], F.col("people")))
Or using na.replace
by passing a dict that you create from the two arrays like this:
from pyspark.sql import functions as F
mapping_dict = dict(zip(array, array_new))
df1 = df.withColumn("new", F.col("people")).na.replace(mapping_dict, subset=["new"])
Another way by chaining multiple when
expressions:
from functools import reduce
from pyspark.sql import functions as F
when_expr = reduce(
lambda acc, x: acc.when(F.col("people") == x[0], x[1]),
zip(array, array_new),
F
).otherwise(F.col("people"))
df1 = df.withColumn("new", when_expr)