How to merge multiple tables together and if last table missing a key use previous table key-CodePudding

If I have multiple tables with same columns

table1:

 ==========
 |id |  a |
 ==========
 |1  | aa |
 |2  | bb |
 |3  | cc |
 |4  | dd |
 ==========

table2:

 ==========
 |id |  a |
 ==========
 |1  | aa |
 |2  | bb |
 |3  | cc |
 |4  | dd |
 ==========

table3:

 ===========
 |id |  a  |
 ==========
 |1  | aaa |
 |2  | bbb |
 |3  | ccc |
 ===========

How do I merge the 3 tables so that the output contains all ids in all 3 tables. It uses the last table table3 id contents and if there is id missing like id 4 it uses the previous time id 4 was seen in table so table2.

Final output should be like below:

 ============
 |id |  a   |
 ============
 |1  |  aaa |
 |2  |  bbb |
 |3  |  ccc |
 |4  |  dd  |
 ============

I was thinking joining by full or union but not sure how to format output table to be like above.

finaloutput = \
    table3 \
        .join(table2, (table3.id == table2.id), "full") \
        .join(table1, (table3.id == table1.id), "full")

CodePudding user response：

I think you are looking for the coalesce function that returns the first non null element in its arguments.

In your case, it would go like this:

import pyspark.sql.functions as F

# take table3.a if it exists, otherwise take table2.a if it exists
# and otherwise take table1.a
table3\
    .join(table2, ['id'], "full")\
    .join(table1, ['id'], "full")\
    .select("id", F.coalesce(table3.a, table2.a, table1.a).alias("a"))\
    .show()

 --- --- 
| id|  a|
 --- --- 
|  1|aaa|
|  2|bbb|
|  3|ccc|
|  4| dd|
 --- ---