How to union two dataframes but giving preference to one if it has data in the same month pyspark-CodePudding

I have two dataframes, one with an estimated daily value, and another with the closed value for the month.

I need to show the estimated daily value ONLY when the closed value for the month does not exist.

Example:

df1:

DATA	ID	VALUE	DSC
2022-01-31	123	10	CLOSED MONTH
2022-02-31	123	20	CLOSED MONTH
2022-03-31	999	30	CLOSED MONTH
2022-04-31	999	40	CLOSED MONTH

df2:

DATA	ID	VALUE	DSC
2022-01-31	123	50	ESTIMATED DAY
2022-02-31	123	60	ESTIMATED DAY
2022-03-31	123	70	ESTIMATED DAY
2022-04-31	123	80	ESTIMATED DAY
2022-03-20	123	90	ESTIMATED DAY
2022-03-25	123	100	ESTIMATED DAY
2022-04-30	999	120	ESTIMATED DAY
2022-05-02	999	150	ESTIMATED DAY
2022-05-03	999	200	ESTIMATED DAY

EXPECTED OUTPUT:

DATA	ID	VALUE	DSC
2022-01-31	123	10	CLOSED MONTH
2022-02-31	123	20	CLOSED MONTH
2022-03-31	999	30	CLOSED MONTH
2022-04-31	999	40	CLOSED MONTH
2022-03-20	123	90	ESTIMATED DAY -Because closed month 3 has different ID
2022-03-25	123	100	ESTIMATED DAY -Because closed month 3 has different ID
2022-05-02	999	150	ESTIMATED DAY -Because there is no closed month 5
2022-05-03	999	200	ESTIMATED DAY -Because there is no closed month 5

Does anyone know a solution?

I tried using window function Row_number, rank and dense_rank, but it didn't work.

CodePudding user response：

create another column which contains the closing date in both df1 and df2.

You can use the isIn function to filter out df2 lines based on df1[df1[closing_date, ID]]

Then, simply concat your 2 tables

CodePudding user response：

Try using a join instead of a union

import pyspark.sql.functions as F

result = df1.join(df2, ["DATA", "ID"], "outer").select(
    "DATA",
    "ID",
    F.coalesce(df1.VALUE, df2.VALUE).alias("VALUE"),
    F.coalesce(df1.DSC, df2.DSC).alias("DSC"),
)

Does that produce the result that you expect?