I'm trying to find the count of cars for the past 5 sales entries in a dataset.
My current approach using the code below is to:
- Calculate the row number for each entry
- Self join the dataframe to get the history for each dealership
- Keep the 5 previous entries
- Sum the sales for these 5
# Calculate row number for each sale
df = df.sort_values(["dealership_id", "time"], ascending=[True, True])
df["row_num"] = df.groupby(["dealership_id"]).cumcount()
df.drop_duplicates()
df_2 = df[["dealership_id", "sales_entry", "car_count", "row_num"]]
# Join to get history
df_3 = pd.merge(
df_2[['dealership_id','sales_entry','row_num']],
df_2[['dealership_id','car_count','row_num']],
how="inner",
on=["dealership_id"],
)
# Keep past 5 sales
df_4 = df_3.loc[(df_3['row_num_x'] - df_3['row_num_y'] > 0) & (df_3['row_num_y'] - df_3['row_num_x'] <= 4)]
# Sum 5 previous sales
df_4 = (
df_4.groupby(["dealership_id", "sales_entry"])
.agg({"car_count": "sum"})
.reset_index()
)
However due to the size of the datasets the second step is far too inefficient. Can anyone recommend a better way to go about solving this?
I also tried to create a column instead of using a join:
df_3['dealership_id'][df_3['dealership_id']].values
But I get the error that "None of [Int64Index .... are in the index". I have checked that my columns named are cleaned and that the data types are all ints.
dealership_id sales_entry car_count time row_num
0 123 entry_asfs 3 11:00 1
1 123 entry_kmsl 0 13:05 2
2 456 entry_sdfm 2 14:10 3
3 456 entry_sknw 1 10:10 1
4 456 entry_kmsl 1 14:35 2
CodePudding user response:
From what it looks like, you are trying to find the sum of the car count for rolling past 5 days, for each dealer. So this would be a group by then rolling sum operation:
df = df.sort_values(["dealership_id", "time"], ascending=[True, True])
df['rolling_sum'] = df.groupby('dealership_id')['car_count'].rolling(5).sum().values