Home > other >  How to union two dataframes but giving preference to one if it has data in the same month pyspark
How to union two dataframes but giving preference to one if it has data in the same month pyspark

Time:12-09

I have two dataframes, one with an estimated daily value, and another with the closed value for the month.

I need to show the estimated daily value ONLY when the closed value for the month does not exist.

Example:

df1:

DATA ID VALUE DSC
2022-01-31 123 10 CLOSED MONTH
2022-02-31 123 20 CLOSED MONTH
2022-03-31 999 30 CLOSED MONTH
2022-04-31 999 40 CLOSED MONTH

df2:

DATA ID VALUE DSC
2022-01-31 123 50 ESTIMATED DAY
2022-02-31 123 60 ESTIMATED DAY
2022-03-31 123 70 ESTIMATED DAY
2022-04-31 123 80 ESTIMATED DAY
2022-03-20 123 90 ESTIMATED DAY
2022-03-25 123 100 ESTIMATED DAY
2022-04-30 999 120 ESTIMATED DAY
2022-05-02 999 150 ESTIMATED DAY
2022-05-03 999 200 ESTIMATED DAY

EXPECTED OUTPUT:

DATA ID VALUE DSC
2022-01-31 123 10 CLOSED MONTH
2022-02-31 123 20 CLOSED MONTH
2022-03-31 999 30 CLOSED MONTH
2022-04-31 999 40 CLOSED MONTH
2022-03-20 123 90 ESTIMATED DAY -Because closed month 3 has different ID
2022-03-25 123 100 ESTIMATED DAY -Because closed month 3 has different ID
2022-05-02 999 150 ESTIMATED DAY -Because there is no closed month 5
2022-05-03 999 200 ESTIMATED DAY -Because there is no closed month 5

Does anyone know a solution?

I tried using window function Row_number, rank and dense_rank, but it didn't work.

CodePudding user response:

create another column which contains the closing date in both df1 and df2.

You can use the isIn function to filter out df2 lines based on df1[df1[closing_date, ID]]

Then, simply concat your 2 tables

CodePudding user response:

Try using a join instead of a union

import pyspark.sql.functions as F

result = df1.join(df2, ["DATA", "ID"], "outer").select(
    "DATA",
    "ID",
    F.coalesce(df1.VALUE, df2.VALUE).alias("VALUE"),
    F.coalesce(df1.DSC, df2.DSC).alias("DSC"),
)

Does that produce the result that you expect?

  • Related