Home > OS >  how to merge rows summing over each column without iteration
how to merge rows summing over each column without iteration

Time:05-10

I have the following df:

   prevent     _p     _n  _id
0        1      0      0  83135
0        0      1      0  83135
0        0      1      0  82238

I would like to merge all rows having the same column _idby summing over each column for the desired output in a dataframe, final (please note that if thee sum is greater than 1, the value should just be 1):

     prevent     _p     _n  _id
0          1      1      0  83135
0          0      1      0  82238

I can easily do this using the following code iterating over the dataframe:

final = pd.DataFrame()
for id_ in _ids:
        
        out = df[df._id == id_]
        
        prevent = 0
        _p = 0
        _n = 0

        d = {}

        if len(out) > 0:

            for row in out.itertuples():
                if prevent == 0:
                    prevent  = row.prevent
                if _p == 0:
                   _p  = row._p
                if _n == 0:
                   _n  = row._n
               
            d['_p'] = _p
            d['_n'] = _n
            d['prevent'] = prevent
           
            t=pd.DataFrame([d])
            t['_id'] = id_
            final=pd.concat([final, t])

I have several hundred thousand rows, so this will be very inefficient. Is there a way to vectorize this?

CodePudding user response:

Treat 0 and 1 as boolean with any, then convert them back to integers:

df.groupby("_id").any().astype("int").reset_index()

CodePudding user response:

Check groupby

out = df.groupby('_id',as_index=False).sum()

CodePudding user response:

groupby.sum to sum over id_, then clip to make sure a sum never exceeds 1:

out = df.groupby('_id').sum().clip(upper=1).reset_index()

Output:

     _id  prevent  _p  _n
0  82238        0   1   0
1  83135        1   1   0
  • Related