Azure Databricks: For each record in a dataframe find the same in another dataframe, insert missing-CodePudding

I have a dataframe df_calendar with 2 columns month and year. I have another dataframe df_stock with 4 columns month, year, plant, stock. Assuming that there are 12 records in df_calendar for the year 2022, I would like to check the stock in df_stock for each month present in df_calendar. Please note that df_stock has records only for the months when the stocks are available. If the stock does not exist for any month, I would like to report the plant name and stock as zero. Which join can I use? If not joins, what is the most efficient way? The data in df_stock is large.

calendar_columns = ["month_number", "year_number"]

calendar = [(1, 2022), \
              (2, 2022), \
              (3, 2022)]

df_calendar = spark.createDataFrame(data=calendar, schema = calendar_columns)

demand_columns = ["month_number", "year_number", "plant_id", "quantity"]

demand = [(1, 2022, "P1", 10), \
            (2, 2022, "P1", 20), \
            (3, 2022, "P1", 30), \
            (1, 2022, "P2", 15)]

df_demand = spark.createDataFrame(data=demand, schema = demand_columns)

For the above example, the output should be

Plant	Month	Year	Stock
P1	1	2022	10
P1	2	2022	20
P1	3	2022	30
P2	1	2022	15
P2	2	2022	0
P2	3	2022	0

CodePudding user response：

It looks like you need to get a distinct list of plants, cross join it with months then join in the demand data. Finally replace nulls with zero.

df_rows = df_demand.select("plant_id").distinct().crossJoin(df_calendar)
df_final = df_rows.join(df_demand,['plant_id','month_number','year_number'], 'leftouter').na.fill(value=0)