Insert one pyspark dataframe to another with replacement some rows-CodePudding

I have two pyspark dataframes that has some mutual key IDs with different other values. What I want to achieve is to inject one dataframe to another.

First Dataframe:

ID1	ID2	DATE	VAL1	VAL2
19	22	05-03-2012	311	622
20	30	05-03-2012	40	60
20	30	06-03-2012	70	120
20	30	07-03-2012	480	3
20	30	08-03-2012	49	98

Second Dataframe:

ID1	ID2	DATE	VAL1	VAL2
19	22	07-03-2012	311	622
20	30	06-03-2012	22	2

Final DF:

ID1	ID2	DATE	VAL1	VAL2
19	22	05-03-2012	311	622
19	22	07-03-2012	311	622
20	30	05-03-2012	40	60
20	30	06-03-2012	70	120
20	30	07-03-2012	480	3
20	30	08-03-2012	49	98

As you can see all the values absent in one of the dataframe is present in final dataframe and rows with the same ID1, ID2, DATE are taking from the first dataframe. These are simplified examples of dataframes, these are much more complicated, with different columns (I'll select the important ones) and hundred of thousands of rows.

I was experimenting with outer join, but after many tries I've lost any hope, so I'd be grateful for any help.

CodePudding user response：

This should work -

Essentially, First do a left_anti join to extract only those rows that are absent from First Dataframe but present in the second dataframe then union them (i.e. append) in First Dataframe

    Seq<String> colList = convertListToSeq(Stream.of("id1", "id2", "date").collect(Collectors.toList()));

    // Only present in Right
    Dataset<Row> missingInLeft = rightDF.join(leftDF, colList, "left_anti");

    leftDF.union(missingInLeft).show(); // Left   Missing in left

Update: Pyspark Code:

rightDF.union(rightDF.join(leftDF, ["id1", "id2", "date"], how='left_anti')).show()