Home > Net >  Pandas dataframe vectorized bucketing/aggregation?
Pandas dataframe vectorized bucketing/aggregation?

Time:09-16

The Task

I have a dataframe that looks like this:

date money_spent ($) meals_eaten weight
2021-01-01 10:00:00 350 5 140
2021-01-02 18:00:00 250 2 170
2021-01-03 12:10:00 200 3 160
2021-01-04 19:40:00 100 1 150

I want to discretize this so that it "cuts" the rows every $X. I want to know some statistics on how much is being done for every $X i spend.

So if I were to use $500 as a threshold, the first two rows would fall in the first cut, and I could aggregate the remaining columns as follows:

  • first date of the cut
  • average meals_eaten
  • minimum weight
  • maximum weight

So the final table would be two rows like this:

date cumulative_spent ($) meals_eaten min_weight max_weight
2021-01-01 10:00:00 600 3.5 140 170
2021-01-03 12:10:00 300 2 150 160

My Approach:

My first instinct is to calculate the cumsum() of the money_spent (assume the data is sorted by date), then I use pd.cut() to basically make a new column, we call it spent_bin, that determines each row's bin.

Note: In this toy example, spent_bin would basically be: [0,500] for the first two rows and (500-1000] for the last two.

Then it's fairly simple, I do a groupby spent_bin then aggregate as follows:

.agg({
    'date':'first', 
    'meals_eaten':'mean', 
    'returns': ['min', 'max']
})

What I've Tried

import pandas as pd


rows = [
{"date":"2021-01-01 10:00:00","money_spent":350, "meals_eaten":5, "weight":140},
{"date":"2021-01-02 18:00:00","money_spent":250, "meals_eaten":2, "weight":170},
{"date":"2021-01-03 12:10:00","money_spent":200, "meals_eaten":3, "weight":160},
{"date":"2021-01-05 22:07:00","money_spent":100, "meals_eaten":1, "weight":150}]

df = pd.DataFrame.from_dict(rows)
df['date'] = pd.to_datetime(df.date)
df['cum_spent'] = df.money_spent.cumsum()

print(df)
print(pd.cut(df.cum_spent, 500))

For some reason, I can't get the cut step to work. Here is my toy code from above. The labels are not cleanly [0-500], (500,1000] for some reason. Honestly I'd settle for [350,500],(500-800] (this is what the actual cum sum values are at the edges of the cuts), but I can't even get that to work even though I'm doing the exact same as the documentation example. Any help with this?


Caveats and Difficulties:

It's pretty easy to write this in a for loop of course, just do a while cum_spent < 500:. The problem is I have millions of rows in my actual dataset, it currently takes me 20 minutes to process a single df this way.

There's also a minor issue that sometimes rows will break the interval. When that happens, I want that last row included. This problem is in the toy example where row #2 actually ends at $600 not $500. But it is the first row that ends at or surpasses $500, so I'm including it in the first bin.

CodePudding user response:

The customized function to achieve the cumsum with reset limitation

df['new'] = cumli(df['money_spent ($)'].values,500)
out = df.groupby(df.new.iloc[::-1].cumsum()).agg(
    date = ('date','first'),
    meals_eaten = ('meals_eaten','mean'),
    min_weight = ('weight','min'),
    max_weight = ('weight','max')).sort_index(ascending=False)
Out[81]: 
            date  meals_eaten  min_weight  max_weight
new                                                  
1    2021-01-01           3.5         140         170
0    2021-01-03           2.0         150         160

from numba import njit
@njit
def cumli(x, lim):
    total = 0
    result = []
    for i, y in enumerate(x):
        check = 0
        total  = y
        if total >= lim:
            total = 0
            check = 1
        result.append(check)
    return result
  • Related