Home > Net >  Sum of rows that a match a condition in a dataframe
Sum of rows that a match a condition in a dataframe

Time:07-23

I have this large dataframe in which I recurrently have duplicated events, one of which has a timestamp of zero:

a     b     c 
139 4E 08 0.234
163 6E 08 0.964
163     0 0.034
172 6E 08 1.173
183 6E 08 0.734`
183     0 0.296    

and so on. What I would like to do is to sum every rows that satisfy the condition timestamp=0 to the previous one, to have something like this:

a     b     c
139 4E 08 0.234
163 6E 08 0.998
172 6E 08 1.173
183 6E 08 1.030

I looked at various solutions but can find the proper one... how could I solve this? thanks

CodePudding user response:

If I understood correctly what you want to do, you could group by the values in column a and sum the values on each of the other columns. Then reset the index to get column a back.

df.groupby('a').sum().reset_index()

Kudos to @ouyang-ze for the suggestion.

CodePudding user response:

The answer assumes that your rows which have 0 values for b are going to have same a value as its previous row. Since you have not explicitly stated this, so here is my answer that generalize without any assumptions.

Using pandas.apply to add the 0 values to each of the previous row in main dataframe.

data_dict = {'a' : [139,163,163,172,183,183], 'b' : [float(4e08), float(6e08), 0, float(6e08), float(6e08), 0], 'c' : [0.234,0.964,0.034,1.173,0.734,0.296]}
df = pd.DataFrame(data_dict)

mask = df['b'].eq(0)

def adder(row):
    if row.name:
        df.loc[row.name-1, 'c']  = row['c']

    
_ = df[df['b'].eq(0)].apply(adder, axis=1)
df = df[~mask]

which gives us the expected output

df
     a            b      c
0  139  400000000.0  0.234
1  163  600000000.0  0.998
3  172  600000000.0  1.173
4  183  600000000.0  1.030

Alternatively

Here is another solution using pandas.dataframe.assign and pandas.dataframe.loc.

df = df.assign(b_t = df['b'].shift(-1), c_t = df['c'].shift(-1) )
mask = df['b_t'].eq(0)
df[mask].loc[:,'c']  = df[mask].loc[:,'c_t']
df = df[~df['b'].eq(0)]

which gives us the same output :

df
     a            b      c
0  139  400000000.0  0.234
1  163  600000000.0  0.964
3  172  600000000.0  1.173
4  183  600000000.0  0.734

Don't want another mask

If you don't want to create another mask, you can use the same mask as well.

df = df.assign(b_t = df['b'].shift(-1), c_t = df['c'].shift(-1) )
mask = df['b_t'].eq(0)
df[mask].loc[:,'c']  = df[mask].loc[:,'c_t']
mask = mask.shift(1).fillna(False)
df.mask(mask, inplace=True)
df = df.dropna().drop(['b_t', 'c_t'], axis = 1)

which gives us

       a            b      c
0  139.0  400000000.0  0.234
1  163.0  600000000.0  0.964
3  172.0  600000000.0  1.173
4  183.0  600000000.0  0.734
  • Related