PySpark when condition not working for smaller string values-CodePudding

I'm trying to change the column values using when condition.

The when condition is working well and the columns are getting updated with the corresponding literal values for both "DJOUR" and "DECBA" which are 5 char length, but the when condition is not working for the other values like "DEP" ,"BUY", "WITH"... which are less than 5 char length.

Why the when condition is not able to identify the smaller length string values? Also, please correct me if I'm using the when condition incorrectly.

df1 = df.withColumn("selling_type", when(col("selling_type") == "BUY", lit("ABCDEF1"))\
        .when(col("selling_type") == "CJOUR", lit("ABCDEF2"))\
        .when(col("selling_type") == "DEP", lit("ABCDEF3"))\   # not working
        .when(col("selling_type") == "DIV", lit("ABCDEF4"))\   # not working
        .when(col("selling_type") == "DJOUR", lit("ABCDEF5"))\ # working
        .when(col("selling_type") == "DECBA", lit("ABCDEF6"))\ # working
        .when(col("selling_type") == "MFEE", lit("ABCDEF7"))\
        .when(col("selling_type") == "SELL", lit("ABCDEF8"))\
        .when(col("selling_type") == "TEFRA", lit("ABCDEF9"))\
        .when(col("selling_type") == "WITH", lit("ABCDEF10"))\
        .otherwise(col("selling_type")))

CodePudding user response：

Try the following, it will trim all the visible and invisible white space from both sides of your strings. Also, you can change all the repetitive when clauses with a simpler replace method.

Input dataframe:

from pyspark.sql import functions as F
df = spark.createDataFrame([("BUY",), ("CJOUR",), ("DEP",), ("DIV",), ("DJOUR",), ("DECBA",), ("MFEE",), ("SELL",), ("TEFRA",), ("WITH",)], ["selling_type"])

Script:

df = df.withColumn("selling_type", F.regexp_replace("selling_type", r"^\s |\s $", ""))
df = df.replace({
    "BUY": "ABCDEF1",
    "CJOUR": "ABCDEF2",
    "DEP": "ABCDEF3",
    "DIV": "ABCDEF4",
    "DJOUR": "ABCDEF5",
    "DECBA": "ABCDEF6",
    "MFEE": "ABCDEF7",
    "SELL": "ABCDEF8",
    "TEFRA": "ABCDEF9",
    "WITH": "ABCDEF10",
}, subset=["selling_type"])

df.show()
#  ------------ 
# |selling_type|
#  ------------ 
# |     ABCDEF1|
# |     ABCDEF2|
# |     ABCDEF3|
# |     ABCDEF4|
# |     ABCDEF5|
# |     ABCDEF6|
# |     ABCDEF7|
# |     ABCDEF8|
# |     ABCDEF9|
# |    ABCDEF10|
#  ------------

CodePudding user response：

.when() does work with strings of arbitrary length. Could you check whether your columns really contain "DEP" without any spaces at the beginning / end etc.?

The following code works for me:

>>> df = spark.createDataFrame(["CJOUR", "DEP"], StringType())
>>> df.printSchema()
root
 |-- value: string (nullable = true)

>>> df = df.withColumnRenamed("value", "selling_type")
>>> from pyspark.sql.functions import when, col, lit
>>> df1 = df.withColumn("selling_type", trim(col("selling_type")).withColumn("selling_type", when(col("selling_type") == "BUY", lit("ABCDEF1")) 
                        .when(col("selling_type") == "CJOUR", lit("ABCDEF2"))
                        .when(col("selling_type") == "DEP", lit("ABCDEF3"))  
                        .when(col("selling_type") == "DIV", lit("ABCDEF4"))
                        .when(col("selling_type") == "DJOUR", lit("ABCDEF5"))
                        .when(col("selling_type") == "DECBA", lit("ABCDEF6"))
                        .when(col("selling_type") == "MFEE", lit("ABCDEF7"))
                        .when(col("selling_type") == "SELL", lit("ABCDEF8"))
                        .when(col("selling_type") == "TEFRA", lit("ABCDEF9"))
                        .when(col("selling_type") == "WITH", lit("ABCDEF10"))
                        .otherwise(col("selling_type")))
>>> 
>>> df1.show()
 ------------                                                                   
|selling_type|
 ------------ 
|     ABCDEF2|
|     ABCDEF3|
 ------------