Home > Back-end >  How to replace a particular value in a Pyspark Dataframe column with another value?
How to replace a particular value in a Pyspark Dataframe column with another value?

Time:11-30

I have the following Pyspark Dataframe called 'df':

A = ["OTH/CON", "Freight Collect", "OTH/CON", "DBG"]
B = [2, 3, 4, 5]
df = sqlContext.createDataFrame(zip(A, B), schema=['A', 'B'])

In the column 'A', I need to replace the values "OTH/CON" & "Freight Collect" with another string "Collect". And replace "DBG" by "Dispose". Then place the values into a new column 'aa'. I do the following:

from pyspark.sql import functions as F
df = df.withColumn("aa", F.when(F.col("A").isin(["OTH/CON"]), F.lit("Collect")).otherwise(F.col("A")))
df = df.withColumn("aa", F.when(F.col("A").isin(["Freight Collect"]), F.lit("Collect")).otherwise(F.col("A")))
df = df.withColumn("aa", F.when(F.col("A").isin(["DBG"]), F.lit("Dispose")).otherwise(F.col("A")))

But I end up getting only "Freight Collect" value changed to "Collect". "OTH/CON" remains as it is.

I'm not able to figure out why!

My expected output is as follows:

 --------------- --- ------- 
|              A|  B|     aa|
 --------------- --- ------- 
|        OTH/CON|  2|Collect|
|Freight Collect|  3|Collect|
|        OTH/CON|  4|Collect|
|            DBG|  5|Dispose|
 --------------- --- ------- 

Can anyone please help?

CodePudding user response:

You can merge multiple isin conditions into one

(df
    .withColumn('aa', F
        .when(F.col('A').isin(['OTH/CON', 'Freight Collect']), F.lit('Collect'))
        .when(F.col('A').isin(['DBG']), F.lit('Dispose'))
        .otherwise(F.col('A'))
    )
    .show()
)

 --------------- --- ------- 
|              A|  B|     aa|
 --------------- --- ------- 
|        OTH/CON|  2|Collect|
|Freight Collect|  3|Collect|
|        OTH/CON|  4|Collect|
|            DBG|  5|Dispose|
 --------------- --- ------- 
  • Related