Reduce rows that has a repeated values in a column and summarises rest of column in different ways m-CodePudding

I have this data frame

import pandas as pd

data = [
    ['ACOT', '00001', '', '', 1.5, 20, 30, 'col1ACOT'],
    ['ACOT', '00002', '', '', 1.7, 20, 33,'col1ACOT'],
    ['ACOT', '00003', '','NA_0001' ,1.4, 20, 40,'col1ACOT'],
    ['PAN', '000090', 'canonical', '', 0.5, 10, 30,'col1PAN'],
    ['PAN', '000091', '', '', 0.4, 10, 30,'col1PAN'],
    ['TOM', '000080', 'canonical', '', 0.4, 10, 15,'col1TOM'],
    ['TOM', '000040', '', '', 1.7, 10, 300,'col1TOM']
]

df = pd.DataFrame(data, columns=[
    'Gene_name', 'Transcript_ID', 'canonical', 'mane', 'metrics','start','end', 'Example_extra_col'])



Gene_name   Transcript_ID   canonical   mane    metrics start   end Example_extra_col  
0   ACOT    00001                               1.5 20  30   col1ACOT
1   ACOT    00002                       NA_0001 1.7 20  33   col1ACOT
2   ACOT    00003                               1.4 20  40   col1ACOT
3   PAN     000090          canonical   NA_00090    0.5 10  30   col1PAN
4   PAN     000091                              0.4 10  30   col1PAN
5   TOM     000080          canonical           0.4 10  15   col1TOM
6   TOM     000040                              1.7 10  300   col1TOM

I want this output

Gene_name       canonical   mane    metrics     Example_extra_col
0   ACOT             No       Yes    1.4-1.5    col1ACOT
4   PAN              Yes      Yes    0.5-0.4    col1PAN
5   TOM              Yes       No    1.7-0.4    col1TOM

Partially, I can do this with these lines

f = lambda x: "Yes" if x.any() else "No" # For canonical and mane
df = df.groupby('Gene_name').agg({'canonical': f, 'mane': f, 'metrics': ['min', 'max']})

 canonical   mane    metrics_min  metrics_max
Gene_name               
ACOT        No          Yes     1.4          1.7
PAN         Yes         Yes     0.4          0.5
TOM         Yes         No      0.4          1.7

But I lost the Example_extra_col (and the following ones) because my real data frame has several more columns.

How can I achieve this?

CodePudding user response：

You can do everything with GroupBy.agg:

out = (df
 .groupby('Gene_name', as_index=False)
 .agg({'canonical': 'any',
       'mane': 'any',
       'metrics': lambda x: f'{x.min()}-{x.max()}',
       'Example_extra_col': 'first',
      })
 .replace({True: 'Yes', False: 'No'})
)

output:

  Gene_name canonical mane  metrics Example_extra_col
0      ACOT        No  Yes  1.4-1.7          col1ACOT
1       PAN       Yes   No  0.4-0.5           col1PAN
2       TOM       Yes   No  0.4-1.7           col1TOM

Alternative with names aggregation if you need to customize the output column names:

(df
 .groupby('Gene_name')
 .agg(**{'canonical': ('canonical', 'any'),
         'mane': ('mane', 'any'),
         'metrics': ('metrics', lambda x: f'{x.min()}-{x.max()}'),
         'Example_extra_col': ('Example_extra_col', 'first')
        })
)