I have a dataframe df_calendar with 2 columns month and year. I have another dataframe df_stock with 4 columns month, year, plant, stock. Assuming that there are 12 records in df_calendar for the year 2022, I would like to check the stock in df_stock for each month present in df_calendar. Please note that df_stock has records only for the months when the stocks are available. If the stock does not exist for any month, I would like to report the plant name and stock as zero. Which join can I use? If not joins, what is the most efficient way? The data in df_stock is large.
calendar_columns = ["month_number", "year_number"]
calendar = [(1, 2022), \
(2, 2022), \
(3, 2022)]
df_calendar = spark.createDataFrame(data=calendar, schema = calendar_columns)
demand_columns = ["month_number", "year_number", "plant_id", "quantity"]
demand = [(1, 2022, "P1", 10), \
(2, 2022, "P1", 20), \
(3, 2022, "P1", 30), \
(1, 2022, "P2", 15)]
df_demand = spark.createDataFrame(data=demand, schema = demand_columns)
For the above example, the output should be
Plant | Month | Year | Stock |
---|---|---|---|
P1 | 1 | 2022 | 10 |
P1 | 2 | 2022 | 20 |
P1 | 3 | 2022 | 30 |
P2 | 1 | 2022 | 15 |
P2 | 2 | 2022 | 0 |
P2 | 3 | 2022 | 0 |
CodePudding user response:
It looks like you need to get a distinct list of plants, cross join it with months then join in the demand data. Finally replace nulls with zero.
df_rows = df_demand.select("plant_id").distinct().crossJoin(df_calendar)
df_final = df_rows.join(df_demand,['plant_id','month_number','year_number'], 'leftouter').na.fill(value=0)