I have a df that looks like this with many rows:
col1 count
a 80
b 100
c 20
I need to chunk this dataframe by sum of the count column not exceeding 100 in sum. So the chunk code create chunks of the df in the list should be like so, where each chunk is determined by the sum of the count column not exceeding 100. Also in my case the index does not matter as long as the column values are the same:
lst_df = [chunk1, chunk2]
Chunk1 =
col1 count
a 80
c 20
Chunk2 =
col1 count
b 100
I can chunk by row count but not sure how to do it by sum of values in a column and repeat that.
n = 25 #chunk row size
list_df = [df[i:i n] for i in range(0,df.shape[0],n)]
CodePudding user response:
Question a little bit vague. What happens if the sum of consecutive rows exceeds hundred? say I have 34, 70, 100? How do I treat that? Also how do you determine elements in list_df ?... manually?
Assuming it will always sum to hundred, approach it this way
Since index does not matter,
- sort df based on count.
- Cumulatively sum row by row.
- Check when multiples of 100 area reached using modulus and convert to integer
- Cumsum 3 above to form groups
- Group df by groups in 4 and subset using list comprehension
- create a dict of subset with your list_df names
- Display df by dict_get method
Code below
lst_df = ['chunk1', 'chunk2']
dfs = dict(zip(lst_df,[x for _, x in df.groupby((df.sort_values('count', ascending=False)['count'].expanding(1).sum()0==0).astype(int).cumsum())]))
dfs.get('chunk1')