Home > Enterprise >  Combining DataFrames with Spark
Combining DataFrames with Spark

Time:07-20

I have two dataframes that both contain a customer id column. Aside from that, they have different columns. I'm trying to combine them into one dataframe that combines any like customer ids and has the remaining columns from both dataframes. Currently I have,

DF1:

Cust_id 7/13/22 Data
0001 1.423
0002 1.664
0003 2.451
0004 1.412

DF2:

Cust_id 7/17/22 Data
0003 1.345
0004 1.456
0005 2.111
0006 1.409

I need 0003 to show the data for 7/13 and 7/17, while 0001 would show the 7/13 data and null for 7/17. I've used unionByName, but I need the ids combined. Any help would be greatly appreciated!

CodePudding user response:

d1 =[['0001', 1.423],['0002', 1.664],['0003', 2.451],['0004', 1.412]]
df1 = spark.createDataFrame(d1, ['Cust_id', '7/13/22 Data'])

d2 =[['0003', 1.345],['0004', 1.456],['0005', 2.111],['0006', 1.409]]
df2 = spark.createDataFrame(d2, ['Cust_id', '7/17/22 Data'])

df1.join(df2, ['Cust_id'], "outer").sort('Cust_id').show()
 ------- ------------ ------------ 
|Cust_id|7/13/22 Data|7/17/22 Data|
 ------- ------------ ------------ 
|   0001|       1.423|        null|
|   0002|       1.664|        null|
|   0003|       2.451|       1.345|
|   0004|       1.412|       1.456|
|   0005|        null|       2.111|
|   0006|        null|       1.409|
 ------- ------------ ------------ 

CodePudding user response:

The Solution does not require any union , you can handle it by outer join.

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("test").getOrCreate()

d1 =[['0001', 1.423],['0002', 1.664],['0003', 2.451],['0004', 1.412]]
df1 = spark.createDataFrame(d1, ['Cust_id', '7/13/22 Data'])

d2 =[['0003', 1.345],['0004', 1.456],['0005', 2.111],['0006', 1.409]]
df2 = spark.createDataFrame(d2, ['Cust_id', '7/17/22 Data'])


df_res = df1.join(df2,['Cust_id'],"outer")

df_res.show()
  • Related