I am trying to sum the values of all the rows (for multiple columns in reality around 50), between the True values of a bool column in a pandas df, and fill the True rows with the sum number. I had a column with names and based on whether the names contained a substring, I made a mask boolean column. Let me give a visual example.
Current df:
mask | col1 | col2 | col3 |
---|---|---|---|
False | 10 | 3 | 5 |
False | 5 | 2 | 4 |
True | |||
False | 5 | 1 | 10 |
False | 1 | 7 | 6 |
False | 8 | 2 | 4 |
True |
Desired/Target df:
mask | col1 | col2 | col3 |
---|---|---|---|
False | 10 | 3 | 5 |
False | 5 | 2 | 4 |
True | 15 | 5 | 9 |
False | 5 | 1 | 10 |
False | 1 | 7 | 6 |
False | 8 | 2 | 4 |
True | 14 | 10 | 20 |
I made it with for loops iterating through the df (more "classic programming") but it takes forever cause the df usually is millions of lines. I am looking for a way to do this with pandas since it's very fast.
Also I tried this which I found in another post: (though I am not familiar with groupby and transform so I don't actually know what I did)
col_list = [x for x in df.columns if 'col' in x]
df[col_list] = df.groupby(df['mask'].cumsum()).transfrom('sum').where(df.mask)
But I got an error "TypeError: can only concatenate str (not "float") to str"
(the actual numbers in the dataframe are floats but I used integers in the example just to make things easy)
Any ideas? Thanks!
CodePudding user response:
Assuming empty cells are NaNs, you could use:
# start a new group after a True
group = df['mask'].shift(fill_value=False).cumsum()
# get the sum per group transforming all rows
# and fill the NaN of the original dataframe with it
df.combine_first(df.groupby(group).transform('sum'))
output:
mask col1 col2 col3
0 False 10.0 3.0 5.0
1 False 5.0 2.0 4.0
2 True 15.0 5.0 9.0
3 False 5.0 1.0 10.0
4 False 1.0 7.0 6.0
5 False 8.0 2.0 4.0
6 True 14.0 10.0 20.0