Home > Mobile >  Pyspark - Merge Dataframe
Pyspark - Merge Dataframe

Time:12-08

I have 2 dataframes that have the same key and I would like to unify them into a single dataframe identifying the origin of each column, is this possible?

df1
 -------------------------------------- -------------------- 
|                                   ID | CURRENCY           |
 -------------------------------------- -------------------- 
| 401148EE-9BA6-4BAA-B113-ED694B0F5BED | 100.00             |
| E90ED21E-C60F-412C-8305-DB5675DA7A5E | 1000.00            |
 -------------------------------------- -------------------- 
df2
 -------------------------------------- -------------------- 
|                                   ID | CURRENCY           |
 -------------------------------------- -------------------- 
| 401148EE-9BA6-4BAA-B113-ED694B0F5BED | 200.00             |
| E90ED21E-C60F-412C-8305-DB5675DA7A5E | 2000.00            |
 -------------------------------------- -------------------- 
Result
 -------------------------------------- -------------------- -------------------- 
|                                   ID | DF1.CURRENCY       | DF2.CURRENCY       |
 -------------------------------------- -------------------- -------------------- 
| 401148EE-9BA6-4BAA-B113-ED694B0F5BED | 100.00             | 200.00             |
| E90ED21E-C60F-412C-8305-DB5675DA7A5E | 1000.00            | 2000.00            |
 -------------------------------------- -------------------- -------------------- 

CodePudding user response:

use join for this case.

Example:

df1=spark.createDataFrame([('401148EE-9BA6-4BAA-B113-ED694B0F5BED',100),('E90ED21E-C60F-412C-8305-DB5675DA7A5E',1000)],['id','currency']).withColumnRenamed("currency","df1.currency")

df2=spark.createDataFrame([('401148EE-9BA6-4BAA-B113-ED694B0F5BED',200),('E90ED21E-C60F-412C-8305-DB5675DA7A5E',2000)],['id','currency']).withColumnRenamed("currency","df2.currency")

df1.join(df2,['id'],'inner').show()
# -------------------- ------------ ------------ 
#|                  id|df1.currency|df2.currency|
# -------------------- ------------ ------------ 
#|401148EE-9BA6-4BA...|         100|         200|
#|E90ED21E-C60F-412...|        1000|        2000|
# -------------------- ------------ ------------ 

CodePudding user response:

df1 = spark.createDataFrame([('401148EE-9BA6-4BAA-B113-ED694B0F5BED', 100.00),('E90ED21E-C60F-412C-8305-DB5675DA7A5E',1000.00 )],['ID','CURRENCY'])
df2 = spark.createDataFrame([('401148EE-9BA6-4BAA-B113-ED694B0F5BED', 200.00),('E90ED21E-C60F-412C-8305-DB5675DA7A5E',2000.00 )],['ID','CURRENCY'])


df1\
  .withColumnRenamed("CURRENCY", "DF1.CURRENCY")\
  .join(df2.withColumnRenamed("CURRENCY", "DF2.CURRENCY"),['ID'],how='full')\
  .show(truncate=False)

 ------------------------------------ ------------ ------------ 
|ID                                  |DF1.CURRENCY|DF2.CURRENCY|
 ------------------------------------ ------------ ------------ 
|401148EE-9BA6-4BAA-B113-ED694B0F5BED|100.0       |200.0       |
|E90ED21E-C60F-412C-8305-DB5675DA7A5E|1000.0      |2000.0      |
 ------------------------------------ ------------ ------------ 
  • Related