The Task
I have a dataframe that looks like this:
date | money_spent ($) | meals_eaten | weight |
---|---|---|---|
2021-01-01 10:00:00 | 350 | 5 | 140 |
2021-01-02 18:00:00 | 250 | 2 | 170 |
2021-01-03 12:10:00 | 200 | 3 | 160 |
2021-01-04 19:40:00 | 100 | 1 | 150 |
I want to discretize this so that it "cuts" the rows every $X
. I want to know some statistics on how much is being done for every $X
i spend.
So if I were to use $500
as a threshold, the first two rows would fall in the first cut, and I could aggregate the remaining columns as follows:
- first
date
of the cut - average
meals_eaten
- minimum
weight
- maximum
weight
So the final table would be two rows like this:
date | cumulative_spent ($) | meals_eaten | min_weight | max_weight |
---|---|---|---|---|
2021-01-01 10:00:00 | 600 | 3.5 | 140 | 170 |
2021-01-03 12:10:00 | 300 | 2 | 150 | 160 |
My Approach:
My first instinct is to calculate the cumsum()
of the money_spent
(assume the data is sorted by date), then I use pd.cut()
to basically make a new column, we call it spent_bin
, that determines each row's bin.
Note: In this toy example, spent_bin
would basically be: [0,500]
for the first two rows and (500-1000]
for the last two.
Then it's fairly simple, I do a groupby spent_bin
then aggregate as follows:
.agg({
'date':'first',
'meals_eaten':'mean',
'returns': ['min', 'max']
})
What I've Tried
import pandas as pd
rows = [
{"date":"2021-01-01 10:00:00","money_spent":350, "meals_eaten":5, "weight":140},
{"date":"2021-01-02 18:00:00","money_spent":250, "meals_eaten":2, "weight":170},
{"date":"2021-01-03 12:10:00","money_spent":200, "meals_eaten":3, "weight":160},
{"date":"2021-01-05 22:07:00","money_spent":100, "meals_eaten":1, "weight":150}]
df = pd.DataFrame.from_dict(rows)
df['date'] = pd.to_datetime(df.date)
df['cum_spent'] = df.money_spent.cumsum()
print(df)
print(pd.cut(df.cum_spent, 500))
For some reason, I can't get the cut
step to work. Here is my toy code from above. The labels are not cleanly [0-500], (500,1000]
for some reason. Honestly I'd settle for [350,500],(500-800]
(this is what the actual cum sum values are at the edges of the cuts), but I can't even get that to work even though I'm doing the exact same as the documentation example. Any help with this?
Caveats and Difficulties:
It's pretty easy to write this in a for loop of course, just do a while cum_spent < 500:
. The problem is I have millions of rows in my actual dataset, it currently takes me 20 minutes to process a single df this way.
There's also a minor issue that sometimes rows will break the interval. When that happens, I want that last row included. This problem is in the toy example where row #2 actually ends at $600 not $500. But it is the first row that ends at or surpasses $500, so I'm including it in the first bin.
CodePudding user response:
The customized function to achieve the cumsum
with reset limitation
df['new'] = cumli(df['money_spent ($)'].values,500)
out = df.groupby(df.new.iloc[::-1].cumsum()).agg(
date = ('date','first'),
meals_eaten = ('meals_eaten','mean'),
min_weight = ('weight','min'),
max_weight = ('weight','max')).sort_index(ascending=False)
Out[81]:
date meals_eaten min_weight max_weight
new
1 2021-01-01 3.5 140 170
0 2021-01-03 2.0 150 160
from numba import njit
@njit
def cumli(x, lim):
total = 0
result = []
for i, y in enumerate(x):
check = 0
total = y
if total >= lim:
total = 0
check = 1
result.append(check)
return result