Home > database >  Pyspark - Union two data frames with same column based n same id
Pyspark - Union two data frames with same column based n same id

Time:06-02

DF1

Id Name Desc etc
A Name1 desc1 etc1
B name2 desc2 etc2

DF2

Id Name Desc etc
A Name2 desc2 etc2
C name2 desc2 etc2

I want to union records from DF2 into DF1 where the ID is equal and include all records from DF1.

Result DF

Id Name Desc etc
A Name1 desc1 etc1
B name2 desc2 etc2
A name2 desc2 etc2

What is the best way to do it? Any help Is appreciated.

CodePudding user response:

You can do so by doing a semi join keeping only the ids in df2 which exists in df1, followed by a union with df1.

d1 = [
    ('A', 'Name1', 'desc1', 'etc1'),
    ('B', 'name2', 'desc2', 'etc2'),
  ]

d2 = [
    ('A', 'Name2', 'desc2', 'etc2'),
    ('C', 'name2', 'desc2', 'etc2'),
]

df1 = spark.createDataFrame(d1, ['Id', 'Name', 'Desc', 'etc'])
df2 = spark.createDataFrame(d2, ['Id', 'Name', 'Desc', 'etc'])

df2.join(df1, on='Id', how='semi').union(df1).show()

 --- ----- ----- ---- 
| Id| Name| Desc| etc|
 --- ----- ----- ---- 
|  A|Name2|desc2|etc2|
|  A|Name1|desc1|etc1|
|  B|name2|desc2|etc2|
 --- ----- ----- ---- 
  • Related