Home > other >  Replace column value based other column values pyspark data frame
Replace column value based other column values pyspark data frame

Time:10-21

I have the following spark data frame.

Date_1         Value     Date_2 
20-10-2021       1        Date 
20-10-2021       2        Date 
21-10-2021       3        Date 
23-10-2021       4        Date 

I would like to fill Date_2 values by adding Date_1 (Value-1).

The output that I would like to see is the following.

Date_1         Value     Date_2 
20-10-2021       1        20-10-2021
20-10-2021       2        21-10-2021 
21-10-2021       3        23-10-2021
23-10-2021       4        26-10-2021 

I have tried this using pyspark.

import pyspark.sql.functions as F

df = df.withColumn('Date_2', F.date_add(df['Date_1'], (df['Value'] -1)).show()

But I am getting TypeError: Column is not iterable.

Can anyone help with this?

CodePudding user response:

The signature of the function date_add is (col, int). Therefore, you cannot use directly df['Value'].

try this :

df = df.withColumn('Date_2', F.expr("date_add(Date_1, Value -1)")).show()

CodePudding user response:

You would need to parse SQL function DATE_ADD like this:

(
    df
    .withColumn("Value", F.col("Value").cast("int"))
    .withColumn("Date_2", 
                F.expr('DATE_ADD(Date_1, Value - 1)')
               )
)

DATE_ADD(Date_1, Value - 1) will add to each row in Date_1 column value from column Value -1 (counting in days).

Additionally (if you don't have it done yet) Value columns should be INT. If you would have there for example DOUBLE type, AnalysisException occur.

  • Related