Home > Mobile >  Pandas sum rows between boolean values of another column
Pandas sum rows between boolean values of another column

Time:02-17

I am trying to sum the values of all the rows (for multiple columns in reality around 50), between the True values of a bool column in a pandas df, and fill the True rows with the sum number. I had a column with names and based on whether the names contained a substring, I made a mask boolean column. Let me give a visual example.

Current df:

mask col1 col2 col3
False 10 3 5
False 5 2 4
True
False 5 1 10
False 1 7 6
False 8 2 4
True

Desired/Target df:

mask col1 col2 col3
False 10 3 5
False 5 2 4
True 15 5 9
False 5 1 10
False 1 7 6
False 8 2 4
True 14 10 20

I made it with for loops iterating through the df (more "classic programming") but it takes forever cause the df usually is millions of lines. I am looking for a way to do this with pandas since it's very fast.

Also I tried this which I found in another post: (though I am not familiar with groupby and transform so I don't actually know what I did)

col_list = [x for x in df.columns if 'col' in x]
df[col_list] = df.groupby(df['mask'].cumsum()).transfrom('sum').where(df.mask)

But I got an error "TypeError: can only concatenate str (not "float") to str"

(the actual numbers in the dataframe are floats but I used integers in the example just to make things easy)

Any ideas? Thanks!

CodePudding user response:

Assuming empty cells are NaNs, you could use:

# start a new group after a True
group = df['mask'].shift(fill_value=False).cumsum()
# get the sum per group transforming all rows
# and fill the NaN of the original dataframe with it
df.combine_first(df.groupby(group).transform('sum'))

output:

    mask  col1  col2  col3
0  False  10.0   3.0   5.0
1  False   5.0   2.0   4.0
2   True  15.0   5.0   9.0
3  False   5.0   1.0  10.0
4  False   1.0   7.0   6.0
5  False   8.0   2.0   4.0
6   True  14.0  10.0  20.0
  • Related