I'm trying to change the column values using when
condition.
The when
condition is working well and the columns are getting updated with the corresponding literal values for both "DJOUR" and "DECBA" which are 5 char length, but the when
condition is not working for the other values like "DEP" ,"BUY", "WITH"... which are less than 5 char length.
Why the when
condition is not able to identify the smaller length string values?
Also, please correct me if I'm using the when
condition incorrectly.
df1 = df.withColumn("selling_type", when(col("selling_type") == "BUY", lit("ABCDEF1"))\
.when(col("selling_type") == "CJOUR", lit("ABCDEF2"))\
.when(col("selling_type") == "DEP", lit("ABCDEF3"))\ # not working
.when(col("selling_type") == "DIV", lit("ABCDEF4"))\ # not working
.when(col("selling_type") == "DJOUR", lit("ABCDEF5"))\ # working
.when(col("selling_type") == "DECBA", lit("ABCDEF6"))\ # working
.when(col("selling_type") == "MFEE", lit("ABCDEF7"))\
.when(col("selling_type") == "SELL", lit("ABCDEF8"))\
.when(col("selling_type") == "TEFRA", lit("ABCDEF9"))\
.when(col("selling_type") == "WITH", lit("ABCDEF10"))\
.otherwise(col("selling_type")))
CodePudding user response:
Try the following, it will trim all the visible and invisible white space from both sides of your strings. Also, you can change all the repetitive when
clauses with a simpler replace
method.
Input dataframe:
from pyspark.sql import functions as F
df = spark.createDataFrame([("BUY",), ("CJOUR",), ("DEP",), ("DIV",), ("DJOUR",), ("DECBA",), ("MFEE",), ("SELL",), ("TEFRA",), ("WITH",)], ["selling_type"])
Script:
df = df.withColumn("selling_type", F.regexp_replace("selling_type", r"^\s |\s $", ""))
df = df.replace({
"BUY": "ABCDEF1",
"CJOUR": "ABCDEF2",
"DEP": "ABCDEF3",
"DIV": "ABCDEF4",
"DJOUR": "ABCDEF5",
"DECBA": "ABCDEF6",
"MFEE": "ABCDEF7",
"SELL": "ABCDEF8",
"TEFRA": "ABCDEF9",
"WITH": "ABCDEF10",
}, subset=["selling_type"])
df.show()
# ------------
# |selling_type|
# ------------
# | ABCDEF1|
# | ABCDEF2|
# | ABCDEF3|
# | ABCDEF4|
# | ABCDEF5|
# | ABCDEF6|
# | ABCDEF7|
# | ABCDEF8|
# | ABCDEF9|
# | ABCDEF10|
# ------------
CodePudding user response:
.when()
does work with strings of arbitrary length. Could you check whether your columns really contain "DEP" without any spaces at the beginning / end etc.?
The following code works for me:
>>> df = spark.createDataFrame(["CJOUR", "DEP"], StringType())
>>> df.printSchema()
root
|-- value: string (nullable = true)
>>> df = df.withColumnRenamed("value", "selling_type")
>>> from pyspark.sql.functions import when, col, lit
>>> df1 = df.withColumn("selling_type", trim(col("selling_type")).withColumn("selling_type", when(col("selling_type") == "BUY", lit("ABCDEF1"))
.when(col("selling_type") == "CJOUR", lit("ABCDEF2"))
.when(col("selling_type") == "DEP", lit("ABCDEF3"))
.when(col("selling_type") == "DIV", lit("ABCDEF4"))
.when(col("selling_type") == "DJOUR", lit("ABCDEF5"))
.when(col("selling_type") == "DECBA", lit("ABCDEF6"))
.when(col("selling_type") == "MFEE", lit("ABCDEF7"))
.when(col("selling_type") == "SELL", lit("ABCDEF8"))
.when(col("selling_type") == "TEFRA", lit("ABCDEF9"))
.when(col("selling_type") == "WITH", lit("ABCDEF10"))
.otherwise(col("selling_type")))
>>>
>>> df1.show()
------------
|selling_type|
------------
| ABCDEF2|
| ABCDEF3|
------------