If I have multiple tables with same columns
table1:
==========
|id | a |
==========
|1 | aa |
|2 | bb |
|3 | cc |
|4 | dd |
==========
table2:
==========
|id | a |
==========
|1 | aa |
|2 | bb |
|3 | cc |
|4 | dd |
==========
table3:
===========
|id | a |
==========
|1 | aaa |
|2 | bbb |
|3 | ccc |
===========
How do I merge the 3 tables so that the output contains all ids in all 3 tables. It uses the last table table3 id contents and if there is id missing like id 4 it uses the previous time id 4 was seen in table so table2.
Final output should be like below:
============
|id | a |
============
|1 | aaa |
|2 | bbb |
|3 | ccc |
|4 | dd |
============
I was thinking joining by full or union but not sure how to format output table to be like above.
finaloutput = \
table3 \
.join(table2, (table3.id == table2.id), "full") \
.join(table1, (table3.id == table1.id), "full")
CodePudding user response:
I think you are looking for the coalesce
function that returns the first non null element in its arguments.
In your case, it would go like this:
import pyspark.sql.functions as F
# take table3.a if it exists, otherwise take table2.a if it exists
# and otherwise take table1.a
table3\
.join(table2, ['id'], "full")\
.join(table1, ['id'], "full")\
.select("id", F.coalesce(table3.a, table2.a, table1.a).alias("a"))\
.show()
--- ---
| id| a|
--- ---
| 1|aaa|
| 2|bbb|
| 3|ccc|
| 4| dd|
--- ---