I have two dataframes that both contain a customer id column. Aside from that, they have different columns. I'm trying to combine them into one dataframe that combines any like customer ids and has the remaining columns from both dataframes. Currently I have,
DF1:
Cust_id | 7/13/22 Data |
---|---|
0001 | 1.423 |
0002 | 1.664 |
0003 | 2.451 |
0004 | 1.412 |
DF2:
Cust_id | 7/17/22 Data |
---|---|
0003 | 1.345 |
0004 | 1.456 |
0005 | 2.111 |
0006 | 1.409 |
I need 0003 to show the data for 7/13 and 7/17, while 0001 would show the 7/13 data and null for 7/17. I've used unionByName, but I need the ids combined. Any help would be greatly appreciated!
CodePudding user response:
d1 =[['0001', 1.423],['0002', 1.664],['0003', 2.451],['0004', 1.412]]
df1 = spark.createDataFrame(d1, ['Cust_id', '7/13/22 Data'])
d2 =[['0003', 1.345],['0004', 1.456],['0005', 2.111],['0006', 1.409]]
df2 = spark.createDataFrame(d2, ['Cust_id', '7/17/22 Data'])
df1.join(df2, ['Cust_id'], "outer").sort('Cust_id').show()
------- ------------ ------------
|Cust_id|7/13/22 Data|7/17/22 Data|
------- ------------ ------------
| 0001| 1.423| null|
| 0002| 1.664| null|
| 0003| 2.451| 1.345|
| 0004| 1.412| 1.456|
| 0005| null| 2.111|
| 0006| null| 1.409|
------- ------------ ------------
CodePudding user response:
The Solution does not require any union , you can handle it by outer join.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("test").getOrCreate()
d1 =[['0001', 1.423],['0002', 1.664],['0003', 2.451],['0004', 1.412]]
df1 = spark.createDataFrame(d1, ['Cust_id', '7/13/22 Data'])
d2 =[['0003', 1.345],['0004', 1.456],['0005', 2.111],['0006', 1.409]]
df2 = spark.createDataFrame(d2, ['Cust_id', '7/17/22 Data'])
df_res = df1.join(df2,['Cust_id'],"outer")
df_res.show()