Pandas dataframe vectorized bucketing/aggregation?-CodePudding

The Task

I have a dataframe that looks like this:

date	money_spent ($)	meals_eaten	weight
2021-01-01 10:00:00	350	5	140
2021-01-02 18:00:00	250	2	170
2021-01-03 12:10:00	200	3	160
2021-01-04 19:40:00	100	1	150

I want to discretize this so that it "cuts" the rows every $X. I want to know some statistics on how much is being done for every $X i spend.

So if I were to use $500 as a threshold, the first two rows would fall in the first cut, and I could aggregate the remaining columns as follows:

first date of the cut
average meals_eaten
minimum weight
maximum weight

So the final table would be two rows like this:

date	cumulative_spent ($)	meals_eaten	min_weight	max_weight
2021-01-01 10:00:00	600	3.5	140	170
2021-01-03 12:10:00	300	2	150	160

My Approach:

My first instinct is to calculate the cumsum() of the money_spent (assume the data is sorted by date), then I use pd.cut() to basically make a new column, we call it spent_bin, that determines each row's bin.

Note: In this toy example, spent_bin would basically be: [0,500] for the first two rows and (500-1000] for the last two.

Then it's fairly simple, I do a groupby spent_bin then aggregate as follows:

.agg({
    'date':'first', 
    'meals_eaten':'mean', 
    'returns': ['min', 'max']
})

What I've Tried

import pandas as pd


rows = [
{"date":"2021-01-01 10:00:00","money_spent":350, "meals_eaten":5, "weight":140},
{"date":"2021-01-02 18:00:00","money_spent":250, "meals_eaten":2, "weight":170},
{"date":"2021-01-03 12:10:00","money_spent":200, "meals_eaten":3, "weight":160},
{"date":"2021-01-05 22:07:00","money_spent":100, "meals_eaten":1, "weight":150}]

df = pd.DataFrame.from_dict(rows)
df['date'] = pd.to_datetime(df.date)
df['cum_spent'] = df.money_spent.cumsum()

print(df)
print(pd.cut(df.cum_spent, 500))

For some reason, I can't get the cut step to work. Here is my toy code from above. The labels are not cleanly [0-500], (500,1000] for some reason. Honestly I'd settle for [350,500],(500-800] (this is what the actual cum sum values are at the edges of the cuts), but I can't even get that to work even though I'm doing the exact same as the documentation example. Any help with this?

Caveats and Difficulties:

It's pretty easy to write this in a for loop of course, just do a while cum_spent < 500:. The problem is I have millions of rows in my actual dataset, it currently takes me 20 minutes to process a single df this way.

There's also a minor issue that sometimes rows will break the interval. When that happens, I want that last row included. This problem is in the toy example where row #2 actually ends at $600 not $500. But it is the first row that ends at or surpasses $500, so I'm including it in the first bin.

CodePudding user response：

The customized function to achieve the cumsum with reset limitation

df['new'] = cumli(df['money_spent ($)'].values,500)
out = df.groupby(df.new.iloc[::-1].cumsum()).agg(
    date = ('date','first'),
    meals_eaten = ('meals_eaten','mean'),
    min_weight = ('weight','min'),
    max_weight = ('weight','max')).sort_index(ascending=False)
Out[81]: 
            date  meals_eaten  min_weight  max_weight
new                                                  
1    2021-01-01           3.5         140         170
0    2021-01-03           2.0         150         160

from numba import njit
@njit
def cumli(x, lim):
    total = 0
    result = []
    for i, y in enumerate(x):
        check = 0
        total  = y
        if total >= lim:
            total = 0
            check = 1
        result.append(check)
    return result