I have this large dataframe in which I recurrently have duplicated events, one of which has a timestamp of zero:
a b c
139 4E 08 0.234
163 6E 08 0.964
163 0 0.034
172 6E 08 1.173
183 6E 08 0.734`
183 0 0.296
and so on. What I would like to do is to sum every rows that satisfy the condition timestamp=0 to the previous one, to have something like this:
a b c
139 4E 08 0.234
163 6E 08 0.998
172 6E 08 1.173
183 6E 08 1.030
I looked at various solutions but can find the proper one... how could I solve this? thanks
CodePudding user response:
If I understood correctly what you want to do, you could group by the values in column a
and sum
the values on each of the other columns. Then reset the index to get column a
back.
df.groupby('a').sum().reset_index()
Kudos to @ouyang-ze for the suggestion.
CodePudding user response:
The answer assumes that your rows which have 0
values for b
are going to have same a
value as its previous row. Since you have not explicitly stated this, so here is my answer that generalize without any assumptions.
Using pandas.apply
to add the 0
values to each of the previous row in main dataframe.
data_dict = {'a' : [139,163,163,172,183,183], 'b' : [float(4e08), float(6e08), 0, float(6e08), float(6e08), 0], 'c' : [0.234,0.964,0.034,1.173,0.734,0.296]}
df = pd.DataFrame(data_dict)
mask = df['b'].eq(0)
def adder(row):
if row.name:
df.loc[row.name-1, 'c'] = row['c']
_ = df[df['b'].eq(0)].apply(adder, axis=1)
df = df[~mask]
which gives us the expected output
df
a b c
0 139 400000000.0 0.234
1 163 600000000.0 0.998
3 172 600000000.0 1.173
4 183 600000000.0 1.030
Alternatively
Here is another solution using pandas.dataframe.assign
and pandas.dataframe.loc
.
df = df.assign(b_t = df['b'].shift(-1), c_t = df['c'].shift(-1) )
mask = df['b_t'].eq(0)
df[mask].loc[:,'c'] = df[mask].loc[:,'c_t']
df = df[~df['b'].eq(0)]
which gives us the same output :
df
a b c
0 139 400000000.0 0.234
1 163 600000000.0 0.964
3 172 600000000.0 1.173
4 183 600000000.0 0.734
Don't want another mask
If you don't want to create another mask, you can use the same mask as well.
df = df.assign(b_t = df['b'].shift(-1), c_t = df['c'].shift(-1) )
mask = df['b_t'].eq(0)
df[mask].loc[:,'c'] = df[mask].loc[:,'c_t']
mask = mask.shift(1).fillna(False)
df.mask(mask, inplace=True)
df = df.dropna().drop(['b_t', 'c_t'], axis = 1)
which gives us
a b c
0 139.0 400000000.0 0.234
1 163.0 600000000.0 0.964
3 172.0 600000000.0 1.173
4 183.0 600000000.0 0.734