Pandas sum column until >= sum(values) then split left df items to dfNEW-CodePudding

I have df with mixed datatypes, example:

df:

name	value
1st	1
2nd	5
3rd	3.5
4th	8

when df['value'].sum() >= 10 split to dfNew (sum of first 3 = 9.5 so need to split left items to dfNew, in my example need split last row)

dfNew:

name	value
4st	8

P.s. I assume I can done it by iterating(itertuples/iteritems) & sum(items) then get indexes and split but what are 'more pandas' way done this?

CodePudding user response：

use cumsum and then do the binning and then do the groupby in binning for example:

df['cumsum_val'] = df['value'].cumsum()

binning = list(range(0, math.ceil(df['cumsum_val'].iloc[-1]) 10), 10) # here you are checking with every 10 (addition of the latest)

df['binning'] = pd.cut(df['cumsum_val'], bins= binning)

after this, you can do the groupby in binnings:

grouped_df = df.groupby('binning')

finally get the splitted df_lists

df_new_lists = [grouped_df.get_group(x) for x in grouped_df.groups]

df_new_lists is the lists of the dataframe

CodePudding user response：

You can use a combination of:

.cumsum to get the cumulative sums;
cut to compare the cumulative sums with your threshold;
.groupby to group rows that received the same result.

import pandas as pd

df = pd.DataFrame({'name': ['1st', '2nd', '3rd', '4th', '5th', '6th', '7th'], 'value': [1,5,3.5,8,2,9,2]})
threshold = 10

df['cumsums'] = df['value'].cumsum()

bins = range(0, int(df['cumsums'].iloc[-1] threshold 1), threshold)
df['groups'] = pd.cut(df['cumsums'], bins=bins)

df_list = [pd.DataFrame(g) for _,g in df.groupby('groups')]

print(df)
#   name  value  cumsums    groups
# 0  1st    1.0      1.0   (0, 10]
# 1  2nd    5.0      6.0   (0, 10]
# 2  3rd    3.5      9.5   (0, 10]
# 3  4th    8.0     17.5  (10, 20]
# 4  5th    2.0     19.5  (10, 20]
# 5  6th    9.0     28.5  (20, 30]
# 6  7th    2.0     30.5  (30, 40]

print(df_list)
# [
#   name  value  cumsums   groups
# 0  1st    1.0      1.0  (0, 10]
# 1  2nd    5.0      6.0  (0, 10]
# 2  3rd    3.5      9.5  (0, 10],
#   name  value  cumsums    groups
# 3  4th    8.0     17.5  (10, 20]
# 4  5th    2.0     19.5  (10, 20],
#   name  value  cumsums    groups
# 5  6th    9.0     28.5  (20, 30]
#   name  value  cumsums    groups
# 6  7th    2.0     30.5  (30, 40]
# ]

Note that this code assumes that values are nonnegative, so that the cumsums are positive and increasing. If that is not the case, then you must use a more robust definition for bins, such as:

bins = range(int(min(df['cumsums'])), int(max(df['cumsums'])) threshold 1, threshold)

CodePudding user response：

Use cumsum and idxmax then slice the dataframe ([idx:])

dfNew = df.iloc[df['value'].cumsum().ge(10).idxmax():]

Output

>>> dfNew
  name  value
3  4th    8.0