Home > OS >  PySpark: modify 2 column values when these 2 columns simultaneously satisfy a condition
PySpark: modify 2 column values when these 2 columns simultaneously satisfy a condition

Time:06-18

I have a PySpark Dataframe where I want to change the values of 2 column simultaneously based on the filter condition involving those 2 columns. I'll give an hypothetical example as I cannot share the data.

--- ----
| Id |Rank|
-- ---
| a | 5 |
| b | 7 |
| c | 8 |
| d | 1 |
| | 9 |
-- ---

Condition: when Id == " " and Rank == 9 then Id = "A1" and Rank = 0, Otherwise no change. Thanks!

CodePudding user response:

You can try to judge the two columns separately.

data = [
    ('a', 5),
    ('b', 7),
    ('c', 8),
    ('d', 1),
    (' ', 8),
    (' ', 9),
    ('e', 9)
]
df = spark.createDataFrame(data, ['id', 'rank'])
df = df.selectExpr(
    'if((id = " " and rank = 9), "A3", id) as id',
    'if((id = " " and rank = 9), 0, rank) as rank'
)
df.show(truncate=False)

#  --- ---- 
# |id |rank|
#  --- ---- 
# |a  |5   |
# |b  |7   |
# |c  |8   |
# |d  |1   |
# |   |8   |
# |A3 |0   |
# |e  |9   |
#  --- ---- 
  • Related