Home > Blockchain >  Apache spark (pyspark), how to replace a value in a column of a row with another value from same col
Apache spark (pyspark), how to replace a value in a column of a row with another value from same col

Time:08-17

Apache spark (pyspark), how to replace a value in a column of a row with another value from same column from a different row

df1.filter(F.col('LAST_NAME') == 'Maltster').withColumn("ANNUAL_HOUSEHOLD_INCOME", df1.filter(F.col('LAST_NAME') == 'Attiwill').select(F.col('ANNUAL_HOUSEHOLD_INCOME'))[0]).show()

I am trying to replace the 'ANNUAL_HOUSEHOLD_INCOME' value in the row with LAST_NAME=Malster with the 'ANNUAL_HOUSEHOLD_INCOME' value in the row with LAST_NAME=Attiwill.

For ex:

Before running the code, the table looks like:
 --------- ----------------------- 
|LAST_NAME|ANNUAL_HOUSEHOLD_INCOME|                               
 --------- ----------------------- 
|Maltster |20000                  |
|Attiwill |100000                 |
 --------- ----------------------- 

After running the code the table should look like:

 --------- ----------------------- 
|LAST_NAME|ANNUAL_HOUSEHOLD_INCOME|                               
 --------- ----------------------- 
|Maltster |100000                 |
|Attiwill |100000                 |
 --------- ----------------------- 

But when I run the above code the value is not being overwritten

CodePudding user response:

I think you should reread the docs, it should be:

df.withColumn('ANNUAL_HOUSEHOLD_INCOME', F.when(F.col('LAST_NAME')=='Maltster', F.lit(100000)).otherwise(F.col('ANNUAL_HOUSEHOLD_INCOME')))

CodePudding user response:

Adding to Jonathan's answer, the below code will give the correct output:

df1.withColumn('ANNUAL_HOUSEHOLD_INCOME', F.when(F.col('LAST_NAME')=='Maltster', df1.filter(F.col('LAST_NAME') == 'Attiwill').select(F.col('ANNUAL_HOUSEHOLD_INCOME')).collect()[0][0]).otherwise(F.col('ANNUAL_HOUSEHOLD_INCOME')))
  • Related