Home > OS >  Add columns values in pyspark for the same customerID
Add columns values in pyspark for the same customerID

Time:09-22

I have two DF one is full load and other is incremental, I want to add data for the customer_ID which are same in both the data_frames. So that my final output is that the customer_ID is same but the value is updated. I want to do it in data frame, I have another approach which is upsert but for that few staging tables have to be made which I want to avoid , how can I do add the data to the main table

Full Table

customer_ID amount
ABC1234 20
ABC1235 10
ABC2222 20
ABC1345 10
ABC4444 5
ABC5555 9

incremental_load

customer_ID amount
ABC1234 5
ABC1235 5
ABC2222 7
ABC1345 20
ABC9999 15

Desired_output

customer_ID amount
ABC1234 25
ABC1235 15
ABC2222 27
ABC1345 30
ABC4444 5
ABC5555 9
ABC9999 15

CodePudding user response:

You can union both dataframes and then sum all entries for the same customer:

from pyspark.sql import functions as F

dfFull= ...
dfInc=...

dfResult=dfFull.union(dfInc).groupBy('customer_ID').agg(F.sum('amount').alias('amount'))
  • Related