How to sum rows containing specific targets in pandas?-CodePudding

Here is my code. I would like to sum the FPKM rows containing all specific target and print all the corresponding targets and sum values in a new pd.

# coding=utf-8
import pandas as pd
import numpy as np

classes = [('Carbon;Pyruvate;vitamins', 16.7, 1),
         ('Pyruvate;Carbohydrate;Pentose and glucuronate', 30, 7),
         ('Lipid;Carbon;Galactose', 40.5, 9),
         ('Galactose;Pyruvate;Fatty acid', 57, 10),
         ('Fatty acid;Lipid', 22, 4)]
labels = ['Ko_class','FPKM', 'count']
alls = pd.DataFrame.from_records(classes, columns=labels)


target = [['Carbon'],['Pyruvate'],['Galactose']]
targetsum = pd.DataFrame.from_records(target,columns=['target'])

#######          
targets = '|'.join(sum(target, []))
targetsum['total_FPKM']=(alls['FPKM']
                                    .groupby(alls['Ko_class']
                                            .str.contains(targets))
                                    .sum())

targetsum['count']=(alls['count']
                                    .groupby(alls['Ko_class']
                                            .str.contains(targets))
                                    .sum())

targetsum

Its results:

     target  total_FPKM  count
0     Carbon         NaN    NaN
1   Pyruvate         NaN    NaN
2  Galactose         NaN    NaN

What I want is :

       target  total_FPKM  count
0     Carbon         57.2     10
1   Pyruvate        103.7     18
2  Galactose         97.5     19

Hope I have described my question clearly:(

CodePudding user response：

try this :

def aggregation(dataframe,target):
    targetsum = pd.DataFrame(columns=['target','sum','count'])
    for val in target:
        df_tempo=dataframe.loc[dataframe['Ko_class'].str.contains(val),:].copy()
    
        new_row = {'target':val, 'sum':df_tempo['FPKM'].sum(), 'count':df_tempo['count'].sum()}
        targetsum = targetsum.append(new_row, ignore_index=True)
    return targetsum

df_result=aggregation(alls,['Carbon','Pyruvate','Galactose'])

Result :

target          sum     count
0   Carbon      57.2    10
1   Pyruvate    103.7   18
2   Galactose   97.5    19

CodePudding user response：

You can use str.findall to find the substances that appear in your 'Ko_class' column, and assign that back to a new column. Exploding this new list-valued column into a separate rows using explode will allow you to groupby on them and perform your aggregation:

target_list = ['Carbon','Pyruvate','Galactose']
target_substances = '|'.join(target_list)

alls.assign(
    Ko_class_contains_target = alls['Ko_class'].str.findall(target_substances)
    ).explode('Ko_class_contains_target').groupby('Ko_class_contains_target').agg('sum')

prints back:

                           FPKM  count
Ko_class_contains_target              
Carbon                     57.2     10
Galactose                  97.5     19
Pyruvate                  103.7     18