Home > Net >  How to chunk dataframe into list by sum of value in column not exceeding X?
How to chunk dataframe into list by sum of value in column not exceeding X?

Time:12-30

I have a df that looks like this with many rows:

col1  count
 a     80
 b     100
 c     20

I need to chunk this dataframe by sum of the count column not exceeding 100 in sum. So the chunk code create chunks of the df in the list should be like so, where each chunk is determined by the sum of the count column not exceeding 100. Also in my case the index does not matter as long as the column values are the same:

lst_df = [chunk1, chunk2]

Chunk1 = 
    col1  count
     a     80
     c     20

Chunk2 = 
    col1  count
     b     100

I can chunk by row count but not sure how to do it by sum of values in a column and repeat that.

n = 25  #chunk row size
list_df = [df[i:i n] for i in range(0,df.shape[0],n)]

CodePudding user response:

Question a little bit vague. What happens if the sum of consecutive rows exceeds hundred? say I have 34, 70, 100? How do I treat that? Also how do you determine elements in list_df ?... manually?

Assuming it will always sum to hundred, approach it this way

Since index does not matter,

  1. sort df based on count.
  2. Cumulatively sum row by row.
  3. Check when multiples of 100 area reached using modulus and convert to integer
  4. Cumsum 3 above to form groups
  5. Group df by groups in 4 and subset using list comprehension
  6. create a dict of subset with your list_df names
  7. Display df by dict_get method

Code below

lst_df = ['chunk1', 'chunk2']

dfs = dict(zip(lst_df,[x for _, x in df.groupby((df.sort_values('count', ascending=False)['count'].expanding(1).sum()0==0).astype(int).cumsum())]))

dfs.get('chunk1')
  • Related