I have 2 dataframes that have the same key and I would like to unify them into a single dataframe identifying the origin of each column, is this possible?
df1
-------------------------------------- --------------------
| ID | CURRENCY |
-------------------------------------- --------------------
| 401148EE-9BA6-4BAA-B113-ED694B0F5BED | 100.00 |
| E90ED21E-C60F-412C-8305-DB5675DA7A5E | 1000.00 |
-------------------------------------- --------------------
df2
-------------------------------------- --------------------
| ID | CURRENCY |
-------------------------------------- --------------------
| 401148EE-9BA6-4BAA-B113-ED694B0F5BED | 200.00 |
| E90ED21E-C60F-412C-8305-DB5675DA7A5E | 2000.00 |
-------------------------------------- --------------------
Result
-------------------------------------- -------------------- --------------------
| ID | DF1.CURRENCY | DF2.CURRENCY |
-------------------------------------- -------------------- --------------------
| 401148EE-9BA6-4BAA-B113-ED694B0F5BED | 100.00 | 200.00 |
| E90ED21E-C60F-412C-8305-DB5675DA7A5E | 1000.00 | 2000.00 |
-------------------------------------- -------------------- --------------------
CodePudding user response:
use join
for this case.
Example:
df1=spark.createDataFrame([('401148EE-9BA6-4BAA-B113-ED694B0F5BED',100),('E90ED21E-C60F-412C-8305-DB5675DA7A5E',1000)],['id','currency']).withColumnRenamed("currency","df1.currency")
df2=spark.createDataFrame([('401148EE-9BA6-4BAA-B113-ED694B0F5BED',200),('E90ED21E-C60F-412C-8305-DB5675DA7A5E',2000)],['id','currency']).withColumnRenamed("currency","df2.currency")
df1.join(df2,['id'],'inner').show()
# -------------------- ------------ ------------
#| id|df1.currency|df2.currency|
# -------------------- ------------ ------------
#|401148EE-9BA6-4BA...| 100| 200|
#|E90ED21E-C60F-412...| 1000| 2000|
# -------------------- ------------ ------------
CodePudding user response:
df1 = spark.createDataFrame([('401148EE-9BA6-4BAA-B113-ED694B0F5BED', 100.00),('E90ED21E-C60F-412C-8305-DB5675DA7A5E',1000.00 )],['ID','CURRENCY'])
df2 = spark.createDataFrame([('401148EE-9BA6-4BAA-B113-ED694B0F5BED', 200.00),('E90ED21E-C60F-412C-8305-DB5675DA7A5E',2000.00 )],['ID','CURRENCY'])
df1\
.withColumnRenamed("CURRENCY", "DF1.CURRENCY")\
.join(df2.withColumnRenamed("CURRENCY", "DF2.CURRENCY"),['ID'],how='full')\
.show(truncate=False)
------------------------------------ ------------ ------------
|ID |DF1.CURRENCY|DF2.CURRENCY|
------------------------------------ ------------ ------------
|401148EE-9BA6-4BAA-B113-ED694B0F5BED|100.0 |200.0 |
|E90ED21E-C60F-412C-8305-DB5675DA7A5E|1000.0 |2000.0 |
------------------------------------ ------------ ------------