Home > Back-end >  Pandas sum column until >= sum(values) then split left df items to dfNEW
Pandas sum column until >= sum(values) then split left df items to dfNEW

Time:09-17

I have df with mixed datatypes, example:

df:

name value
1st 1
2nd 5
3rd 3.5
4th 8

when df['value'].sum() >= 10 split to dfNew (sum of first 3 = 9.5 so need to split left items to dfNew, in my example need split last row)

dfNew:

name value
4st 8

P.s. I assume I can done it by iterating(itertuples/iteritems) & sum(items) then get indexes and split but what are 'more pandas' way done this?

CodePudding user response:

use cumsum and then do the binning and then do the groupby in binning for example:

df['cumsum_val'] = df['value'].cumsum()

binning = list(range(0, math.ceil(df['cumsum_val'].iloc[-1]) 10), 10) # here you are checking with every 10 (addition of the latest)

df['binning'] = pd.cut(df['cumsum_val'], bins= binning)

after this, you can do the groupby in binnings:

grouped_df = df.groupby('binning')

finally get the splitted df_lists

df_new_lists = [grouped_df.get_group(x) for x in grouped_df.groups]

df_new_lists is the lists of the dataframe

CodePudding user response:

You can use a combination of:

  • .cumsum to get the cumulative sums;
  • cut to compare the cumulative sums with your threshold;
  • .groupby to group rows that received the same result.
import pandas as pd

df = pd.DataFrame({'name': ['1st', '2nd', '3rd', '4th', '5th', '6th', '7th'], 'value': [1,5,3.5,8,2,9,2]})
threshold = 10

df['cumsums'] = df['value'].cumsum()

bins = range(0, int(df['cumsums'].iloc[-1] threshold 1), threshold)
df['groups'] = pd.cut(df['cumsums'], bins=bins)

df_list = [pd.DataFrame(g) for _,g in df.groupby('groups')]

print(df)
#   name  value  cumsums    groups
# 0  1st    1.0      1.0   (0, 10]
# 1  2nd    5.0      6.0   (0, 10]
# 2  3rd    3.5      9.5   (0, 10]
# 3  4th    8.0     17.5  (10, 20]
# 4  5th    2.0     19.5  (10, 20]
# 5  6th    9.0     28.5  (20, 30]
# 6  7th    2.0     30.5  (30, 40]

print(df_list)
# [
#   name  value  cumsums   groups
# 0  1st    1.0      1.0  (0, 10]
# 1  2nd    5.0      6.0  (0, 10]
# 2  3rd    3.5      9.5  (0, 10],
#   name  value  cumsums    groups
# 3  4th    8.0     17.5  (10, 20]
# 4  5th    2.0     19.5  (10, 20],
#   name  value  cumsums    groups
# 5  6th    9.0     28.5  (20, 30]
#   name  value  cumsums    groups
# 6  7th    2.0     30.5  (30, 40]
# ]

Note that this code assumes that values are nonnegative, so that the cumsums are positive and increasing. If that is not the case, then you must use a more robust definition for bins, such as:

bins = range(int(min(df['cumsums'])), int(max(df['cumsums'])) threshold 1, threshold)

CodePudding user response:

Use cumsum and idxmax then slice the dataframe ([idx:])

dfNew = df.iloc[df['value'].cumsum().ge(10).idxmax():]

Output

>>> dfNew
  name  value
3  4th    8.0
  • Related