I have df with mixed datatypes, example:
df:
name | value |
---|---|
1st | 1 |
2nd | 5 |
3rd | 3.5 |
4th | 8 |
when df['value'].sum() >= 10 split to dfNew (sum of first 3 = 9.5 so need to split left items to dfNew, in my example need split last row)
dfNew:
name | value |
---|---|
4st | 8 |
P.s. I assume I can done it by iterating(itertuples/iteritems) & sum(items) then get indexes and split but what are 'more pandas' way done this?
CodePudding user response:
use cumsum
and then do the binning
and then do the groupby
in binning for example:
df['cumsum_val'] = df['value'].cumsum()
binning = list(range(0, math.ceil(df['cumsum_val'].iloc[-1]) 10), 10) # here you are checking with every 10 (addition of the latest
)
df['binning'] = pd.cut(df['cumsum_val'], bins= binning)
after this, you can do the groupby in binnings:
grouped_df = df.groupby('binning')
finally get the splitted df_lists
df_new_lists = [grouped_df.get_group(x) for x in grouped_df.groups]
df_new_lists
is the lists of the dataframe
CodePudding user response:
You can use a combination of:
.cumsum
to get the cumulative sums;cut
to compare the cumulative sums with your threshold;.groupby
to group rows that received the same result.
import pandas as pd
df = pd.DataFrame({'name': ['1st', '2nd', '3rd', '4th', '5th', '6th', '7th'], 'value': [1,5,3.5,8,2,9,2]})
threshold = 10
df['cumsums'] = df['value'].cumsum()
bins = range(0, int(df['cumsums'].iloc[-1] threshold 1), threshold)
df['groups'] = pd.cut(df['cumsums'], bins=bins)
df_list = [pd.DataFrame(g) for _,g in df.groupby('groups')]
print(df)
# name value cumsums groups
# 0 1st 1.0 1.0 (0, 10]
# 1 2nd 5.0 6.0 (0, 10]
# 2 3rd 3.5 9.5 (0, 10]
# 3 4th 8.0 17.5 (10, 20]
# 4 5th 2.0 19.5 (10, 20]
# 5 6th 9.0 28.5 (20, 30]
# 6 7th 2.0 30.5 (30, 40]
print(df_list)
# [
# name value cumsums groups
# 0 1st 1.0 1.0 (0, 10]
# 1 2nd 5.0 6.0 (0, 10]
# 2 3rd 3.5 9.5 (0, 10],
# name value cumsums groups
# 3 4th 8.0 17.5 (10, 20]
# 4 5th 2.0 19.5 (10, 20],
# name value cumsums groups
# 5 6th 9.0 28.5 (20, 30]
# name value cumsums groups
# 6 7th 2.0 30.5 (30, 40]
# ]
Note that this code assumes that values are nonnegative, so that the cumsums are positive and increasing. If that is not the case, then you must use a more robust definition for bins
, such as:
bins = range(int(min(df['cumsums'])), int(max(df['cumsums'])) threshold 1, threshold)
CodePudding user response:
Use cumsum
and idxmax
then slice the dataframe ([idx:])
dfNew = df.iloc[df['value'].cumsum().ge(10).idxmax():]
Output
>>> dfNew
name value
3 4th 8.0