I have two pyspark dataframes that has some mutual key IDs with different other values. What I want to achieve is to inject one dataframe to another.
First Dataframe:
ID1 | ID2 | DATE | VAL1 | VAL2 |
---|---|---|---|---|
19 | 22 | 05-03-2012 | 311 | 622 |
20 | 30 | 05-03-2012 | 40 | 60 |
20 | 30 | 06-03-2012 | 70 | 120 |
20 | 30 | 07-03-2012 | 480 | 3 |
20 | 30 | 08-03-2012 | 49 | 98 |
Second Dataframe:
ID1 | ID2 | DATE | VAL1 | VAL2 |
---|---|---|---|---|
19 | 22 | 07-03-2012 | 311 | 622 |
20 | 30 | 06-03-2012 | 22 | 2 |
Final DF:
ID1 | ID2 | DATE | VAL1 | VAL2 |
---|---|---|---|---|
19 | 22 | 05-03-2012 | 311 | 622 |
19 | 22 | 07-03-2012 | 311 | 622 |
20 | 30 | 05-03-2012 | 40 | 60 |
20 | 30 | 06-03-2012 | 70 | 120 |
20 | 30 | 07-03-2012 | 480 | 3 |
20 | 30 | 08-03-2012 | 49 | 98 |
As you can see all the values absent in one of the dataframe is present in final dataframe and rows with the same ID1
, ID2
, DATE
are taking from the first dataframe. These are simplified examples of dataframes, these are much more complicated, with different columns (I'll select the important ones) and hundred of thousands of rows.
I was experimenting with outer join, but after many tries I've lost any hope, so I'd be grateful for any help.
CodePudding user response:
This should work -
Essentially, First do a left_anti
join to extract only those rows that are absent from First Dataframe but present in the second dataframe then union them (i.e. append) in First Dataframe
Seq<String> colList = convertListToSeq(Stream.of("id1", "id2", "date").collect(Collectors.toList()));
// Only present in Right
Dataset<Row> missingInLeft = rightDF.join(leftDF, colList, "left_anti");
leftDF.union(missingInLeft).show(); // Left Missing in left
Update: Pyspark Code:
rightDF.union(rightDF.join(leftDF, ["id1", "id2", "date"], how='left_anti')).show()