I have two dataframes, one with an estimated daily value, and another with the closed value for the month.
I need to show the estimated daily value ONLY when the closed value for the month does not exist.
Example:
df1:
DATA | ID | VALUE | DSC |
---|---|---|---|
2022-01-31 | 123 | 10 | CLOSED MONTH |
2022-02-31 | 123 | 20 | CLOSED MONTH |
2022-03-31 | 999 | 30 | CLOSED MONTH |
2022-04-31 | 999 | 40 | CLOSED MONTH |
df2:
DATA | ID | VALUE | DSC |
---|---|---|---|
2022-01-31 | 123 | 50 | ESTIMATED DAY |
2022-02-31 | 123 | 60 | ESTIMATED DAY |
2022-03-31 | 123 | 70 | ESTIMATED DAY |
2022-04-31 | 123 | 80 | ESTIMATED DAY |
2022-03-20 | 123 | 90 | ESTIMATED DAY |
2022-03-25 | 123 | 100 | ESTIMATED DAY |
2022-04-30 | 999 | 120 | ESTIMATED DAY |
2022-05-02 | 999 | 150 | ESTIMATED DAY |
2022-05-03 | 999 | 200 | ESTIMATED DAY |
EXPECTED OUTPUT:
DATA | ID | VALUE | DSC |
---|---|---|---|
2022-01-31 | 123 | 10 | CLOSED MONTH |
2022-02-31 | 123 | 20 | CLOSED MONTH |
2022-03-31 | 999 | 30 | CLOSED MONTH |
2022-04-31 | 999 | 40 | CLOSED MONTH |
2022-03-20 | 123 | 90 | ESTIMATED DAY -Because closed month 3 has different ID |
2022-03-25 | 123 | 100 | ESTIMATED DAY -Because closed month 3 has different ID |
2022-05-02 | 999 | 150 | ESTIMATED DAY -Because there is no closed month 5 |
2022-05-03 | 999 | 200 | ESTIMATED DAY -Because there is no closed month 5 |
Does anyone know a solution?
I tried using window function Row_number, rank and dense_rank, but it didn't work.
CodePudding user response:
create another column which contains the closing date in both df1 and df2.
You can use the isIn function to filter out df2 lines based on df1[df1[closing_date, ID]]
Then, simply concat your 2 tables
CodePudding user response:
Try using a join instead of a union
import pyspark.sql.functions as F
result = df1.join(df2, ["DATA", "ID"], "outer").select(
"DATA",
"ID",
F.coalesce(df1.VALUE, df2.VALUE).alias("VALUE"),
F.coalesce(df1.DSC, df2.DSC).alias("DSC"),
)
Does that produce the result that you expect?