I have the following df
:
prevent _p _n _id
0 1 0 0 83135
0 0 1 0 83135
0 0 1 0 82238
I would like to merge all rows having the same column _id
by summing over each column for
the desired output in a dataframe, final
(please note that if thee sum is greater than 1, the value should just be 1):
prevent _p _n _id
0 1 1 0 83135
0 0 1 0 82238
I can easily do this using the following code iterating over the dataframe:
final = pd.DataFrame()
for id_ in _ids:
out = df[df._id == id_]
prevent = 0
_p = 0
_n = 0
d = {}
if len(out) > 0:
for row in out.itertuples():
if prevent == 0:
prevent = row.prevent
if _p == 0:
_p = row._p
if _n == 0:
_n = row._n
d['_p'] = _p
d['_n'] = _n
d['prevent'] = prevent
t=pd.DataFrame([d])
t['_id'] = id_
final=pd.concat([final, t])
I have several hundred thousand rows, so this will be very inefficient. Is there a way to vectorize this?
CodePudding user response:
Treat 0 and 1 as boolean with any
, then convert them back to integers:
df.groupby("_id").any().astype("int").reset_index()
CodePudding user response:
Check groupby
out = df.groupby('_id',as_index=False).sum()
CodePudding user response:
groupby.sum
to sum over id_
, then clip
to make sure a sum never exceeds 1:
out = df.groupby('_id').sum().clip(upper=1).reset_index()
Output:
_id prevent _p _n
0 82238 0 1 0
1 83135 1 1 0