Home > database >  More efficient ways to self join a dataframe?
More efficient ways to self join a dataframe?

Time:11-19

I'm trying to find the count of cars for the past 5 sales entries in a dataset.

My current approach using the code below is to:

  1. Calculate the row number for each entry
  2. Self join the dataframe to get the history for each dealership
  3. Keep the 5 previous entries
  4. Sum the sales for these 5
# Calculate row number for each sale
df = df.sort_values(["dealership_id", "time"], ascending=[True, True])
df["row_num"] = df.groupby(["dealership_id"]).cumcount()
df.drop_duplicates()
df_2 = df[["dealership_id", "sales_entry", "car_count", "row_num"]]

# Join to get history
df_3 = pd.merge(
    df_2[['dealership_id','sales_entry','row_num']],
    df_2[['dealership_id','car_count','row_num']],
    how="inner",
    on=["dealership_id"],
)

# Keep past 5 sales
df_4 = df_3.loc[(df_3['row_num_x'] - df_3['row_num_y'] > 0) & (df_3['row_num_y'] - df_3['row_num_x'] <= 4)]

# Sum 5 previous sales
df_4 = (
    df_4.groupby(["dealership_id", "sales_entry"])
    .agg({"car_count": "sum"})
    .reset_index()
)

However due to the size of the datasets the second step is far too inefficient. Can anyone recommend a better way to go about solving this?

I also tried to create a column instead of using a join:

df_3['dealership_id'][df_3['dealership_id']].values

But I get the error that "None of [Int64Index .... are in the index". I have checked that my columns named are cleaned and that the data types are all ints.

   dealership_id sales_entry  car_count   time  row_num
0            123  entry_asfs          3  11:00        1
1            123  entry_kmsl          0  13:05        2
2            456  entry_sdfm          2  14:10        3
3            456  entry_sknw          1  10:10        1
4            456  entry_kmsl          1  14:35        2

CodePudding user response:

From what it looks like, you are trying to find the sum of the car count for rolling past 5 days, for each dealer. So this would be a group by then rolling sum operation:

df = df.sort_values(["dealership_id", "time"], ascending=[True, True])
df['rolling_sum'] = df.groupby('dealership_id')['car_count'].rolling(5).sum().values
  • Related