Home > OS >  Azure Databricks: For each record in a dataframe find the same in another dataframe, insert missing
Azure Databricks: For each record in a dataframe find the same in another dataframe, insert missing

Time:01-14

I have a dataframe df_calendar with 2 columns month and year. I have another dataframe df_stock with 4 columns month, year, plant, stock. Assuming that there are 12 records in df_calendar for the year 2022, I would like to check the stock in df_stock for each month present in df_calendar. Please note that df_stock has records only for the months when the stocks are available. If the stock does not exist for any month, I would like to report the plant name and stock as zero. Which join can I use? If not joins, what is the most efficient way? The data in df_stock is large.

calendar_columns = ["month_number", "year_number"]

calendar = [(1, 2022), \
              (2, 2022), \
              (3, 2022)]

df_calendar = spark.createDataFrame(data=calendar, schema = calendar_columns)

demand_columns = ["month_number", "year_number", "plant_id", "quantity"]

demand = [(1, 2022, "P1", 10), \
            (2, 2022, "P1", 20), \
            (3, 2022, "P1", 30), \
            (1, 2022, "P2", 15)]

df_demand = spark.createDataFrame(data=demand, schema = demand_columns)

For the above example, the output should be

Plant Month Year Stock
P1 1 2022 10
P1 2 2022 20
P1 3 2022 30
P2 1 2022 15
P2 2 2022 0
P2 3 2022 0

CodePudding user response:

It looks like you need to get a distinct list of plants, cross join it with months then join in the demand data. Finally replace nulls with zero.

df_rows = df_demand.select("plant_id").distinct().crossJoin(df_calendar)
df_final = df_rows.join(df_demand,['plant_id','month_number','year_number'], 'leftouter').na.fill(value=0)
  • Related