Reduce rows that has a repeated values in a column and summarises rest of column in different ways-CodePudding

I have a table like this

import pandas as pd

data = [
    ['ACOT', '00001', '', '', 1.5, 20, 30],
    ['ACOT', '00002', '', '', 1.7, 20, 33],
    ['ACOT', '00003', '','NA_0001' ,1.4, 20, 40],
    ['PAN', '000090', 'canonical', '', 0.5, 10, 30],
    ['PAN', '000091', '', '', 0.4, 10, 30],
    ['TOM', '000080', 'canonical', '', 0.4, 10, 15],
    ['TOM', '000040', '', '', 1.7, 10, 300]
]

df = pd.DataFrame(data, columns=[
    'Gene_name', 'Transcript_ID', 'canonical', 'mane', 'metrics','start','end'])

The output

Gene_name   Transcript_ID   canonical   mane    metrics start   end
0   ACOT    00001                               1.5 20  30
1   ACOT    00002                       NA_0001 1.7 20  33
2   ACOT    00003                               1.4 20  40
3   PAN     000090          canonical   NA_00090    0.5 10  30
4   PAN     000091                              0.4 10  30
5   TOM     000080          canonical           0.4 10  15
6   TOM     000040                              1.7 10  300

I want this output

Gene_name       canonical   mane    metrics
0   ACOT             No       Yes    1.4-1.5
4   PAN              Yes      Yes    0.5-0.4    
5   TOM              Yes       No    1.7-0.4

If mane/canonical have a value put yes otherwise put no. Create the range between the highest and the lowest value found in the metrics column.

CodePudding user response：

EDIT: Beside my solution works I would suggest to accept the answer of @965311532 because it is a very good combination of the "pandas-way" and "pythonic" code. Much less lines and much better to read.

Now my solution: I have to say your example code doesn't provide the exact output data. Look into mane column for example. And the metrics column in your desired output doesn't make sense because min-max and max-min are mixed.

My example code here produce this output as sample data:

  Gene_name  canonical     mane  metrics
0      ACOT                          1.5
1      ACOT                          1.7
2      ACOT             NA_0001      1.4
3       PAN  canonical               0.5
4       PAN                          0.4
5       TOM  canonical               0.4
6       TOM                          1.7

The result after transformation is

          canonical mane  metrics
Gene_name
ACOT             No   No  1.4-1.7
PAN             Yes   No  0.4-0.5
TOM             Yes   No  0.4-1.7

That is the solution code.

#!/usr/bin/env python3
import pandas as pd

data = [
    ['ACOT', '00001', '', '', 1.5, 20, 30],
    ['ACOT', '00002', '', '', 1.7, 20, 33],
    ['ACOT', '00003', '','NA_0001' ,1.4, 20, 40],
    ['PAN', '000090', 'canonical', '', 0.5, 10, 30],
    ['PAN', '000091', '', '', 0.4, 10, 30],
    ['TOM', '000080', 'canonical', '', 0.4, 10, 15],
    ['TOM', '000040', '', '', 1.7, 10, 300]
]

df = pd.DataFrame(data, columns=[
    'Gene_name', 'Transcript_ID', 'canonical', 'mane', 'metrics','start','end'])

# select only the columns we need
df = df.loc[:, ['Gene_name', 'canonical', 'mane', 'metrics']]

# helper function to used in apply()
def my_helper(group):
    # temporary store the result as a dict
    result = {}

    # check for existance of the values
    # it is easy because columnname, value and result column
    # have the same "name"
    for col in ['canonical', 'mane']:
        result[col] = \
            'Yes' if group[col].isin([col]).any() else 'No'


    # I use string format here. Could also be a tuple,
    # list, Series or what every you want.
    result['metrics'] = '{}-{}'.format(group.metrics.min(),
                                       group.metrics.max())

    return pd.Series(result)

desired = df.groupby('Gene_name').apply(my_helper)

The important part here is to use groupby() and then do the calculations and checks in a separate helper function.

Some details

About that line

def my_helper(group):
    result = {'Gene_name': group.Gene_name.iloc[0]}

The my_helper() function is called three times for each of your Gene_name. The group argument is a DataFrame. For example in the first call its content looks like this:

  Gene_name canonical     mane  metrics
0      ACOT                         1.5
1      ACOT                         1.7
2      ACOT            NA_0001      1.4

To collect the results I use a dict() here because the keys it are later used as column names for the resulting data frame.

About this line:

for col in ['canonical', 'mane']:
    result[col] = \
        'Yes' if group[col].isin([col]).any() else 'No'

This maybe looks a bit wired but it is IMHO more pythonic. It could be also written with some more lines like this.

if group['canonical'].isin(['canonical']).any():
    result['canonical'] = 'Yes'
else:
    result['canonical'] = 'No'
# and the same for 'mane'

The part .isin(['canonical']) gives you a list of boolean values. The .any() return True if there is minimal one True in that list and False if not.

About that line

return pd.Series(result)

Here a Series object is created out of the dict. From the first call of my_helper() the series looks like this:

Gene_name       ACOT
canonical         No
mane              No
metrics      1.4-1.7

It is a list where each item has its own name. The complete code will produce three of such series objects (for each Gene_name). The apply() does use each series as a row and glue them together to a new data frame.

CodePudding user response：

You can do this:

f = lambda x: "Yes" if x.any() else "No" # For canonical and mane
df = df.groupby('Gene_name').agg({'canonical': f, 'mane': f, 'metrics': ['min', 'max']})

# Rename columns
df.columns = ["canonical", "mane", "metrics_min", "metrics_max"]

Output:

            canonical   mane    metrics_min  metrics_max
Gene_name               
ACOT        No          Yes     1.4          1.7
PAN         Yes         Yes     0.4          0.5
TOM         Yes         No      0.4          1.7