Home > Software engineering >  How create a pyspark dataframe with sum of columns from different dataframes?
How create a pyspark dataframe with sum of columns from different dataframes?

Time:07-08

I have two tables DF1, DF2 and I need create an DF3 with the same columns (all columns are the same) but the value of each needs to be the sum of respective columns in the other dataframes, here is a little example, the real dfs have 38 columns:

edit: id column is the only one that always be need the same value.

DF1

id val1 val2 val3
1 10 20 30
2 15 10 5

DF2

id val1 val2 val3
1 1 2 3
2 5 10 15

Expected DF3

id val1 val2 val3
1 11 22 33
2 20 20 20

I'm newbie working with pyspark and I had no clue how to do it.

CodePudding user response:

union, groupby and agg the sum.

DF1.unionByName(DF2).groupby('id').agg(*[sum(x).alias(x) for x in ['val1','val2','val3']]).show()
  • Related