Home > OS >  Insert one pyspark dataframe to another with replacement some rows
Insert one pyspark dataframe to another with replacement some rows

Time:12-30

I have two pyspark dataframes that has some mutual key IDs with different other values. What I want to achieve is to inject one dataframe to another.

First Dataframe:

ID1 ID2 DATE VAL1 VAL2
19 22 05-03-2012 311 622
20 30 05-03-2012 40 60
20 30 06-03-2012 70 120
20 30 07-03-2012 480 3
20 30 08-03-2012 49 98

Second Dataframe:

ID1 ID2 DATE VAL1 VAL2
19 22 07-03-2012 311 622
20 30 06-03-2012 22 2

Final DF:

ID1 ID2 DATE VAL1 VAL2
19 22 05-03-2012 311 622
19 22 07-03-2012 311 622
20 30 05-03-2012 40 60
20 30 06-03-2012 70 120
20 30 07-03-2012 480 3
20 30 08-03-2012 49 98

As you can see all the values absent in one of the dataframe is present in final dataframe and rows with the same ID1, ID2, DATE are taking from the first dataframe. These are simplified examples of dataframes, these are much more complicated, with different columns (I'll select the important ones) and hundred of thousands of rows.

I was experimenting with outer join, but after many tries I've lost any hope, so I'd be grateful for any help.

CodePudding user response:

This should work -

Essentially, First do a left_anti join to extract only those rows that are absent from First Dataframe but present in the second dataframe then union them (i.e. append) in First Dataframe

    Seq<String> colList = convertListToSeq(Stream.of("id1", "id2", "date").collect(Collectors.toList()));

    // Only present in Right
    Dataset<Row> missingInLeft = rightDF.join(leftDF, colList, "left_anti");

    leftDF.union(missingInLeft).show(); // Left   Missing in left

Update: Pyspark Code:

rightDF.union(rightDF.join(leftDF, ["id1", "id2", "date"], how='left_anti')).show()
  • Related