Home > database >  How to groupby certain column in a dataframe in pandas?
How to groupby certain column in a dataframe in pandas?

Time:08-11

I have the following dataframe where I have different genes, drugs ID's and citations. I essentially need the same gene to be merged with the same drug but include both citations for that drug if it is to occur. For example below: pharmacogenomic

      Gene                          Drug                     ID     Cite
1  MAD1L1                       Lithium[17]           34718328     [17]
2    OAS1                       Lithium[17]           34718328     [17]
3    OAS1                       Lithium[7]            27401222      [7]

MAD1L1 has lithium and citation 17, but OAS1 has lithium and citation 17 and 7. I would like to concat the table into something similar to below:

      Gene                          Drug                     ID     Cite
1  MAD1L1                       Lithium[17]           34718328     [17]
2    OAS1                       Lithium[17][7]        34718328     [17]

OAS1 has lithium ,but both citation are next to eachother, and MAD1L1 is unchanged as it does not share the same citation for lithium as OAS1.

CodePudding user response:

here is one way to do it

#use cite to group together the citations
df['cite2']=df.groupby('Gene')['Cite'].transform('sum')

#group by gene, and take the first result for each gene
df2=df.groupby('Gene').first()

#split the citation from the Drug name and append the cite2 (created above)
df2['Drug']=df2['Drug'].str.split('[', expand=True)[0]   df2['cite2']

# drop the temporary cite2 columns
df2.drop(columns='cite2', inplace=True)
df2.reset_index()
    Gene    Drug    ID  Cite
0   MAD1L1  Lithium[17]     34718328    [17]
1   OAS1    Lithium[17][7]  34718328    [17]

CodePudding user response:

Remove the citation from "Drug", then groupby.agg, either as 'first' or to join the strings. Then add back the citations:

out = (df
 .assign(Drug=df['Drug'].str.extract(r'(^[^\[\]] )', expand=False))
 .groupby(['Gene', 'Drug'], as_index=False)
 .agg({'ID': 'first', 'Cite': ''.join})
 .assign(Drug=lambda d: d['Drug'] d['Cite'])
)

Output:

     Gene            Drug        ID     Cite
0  MAD1L1     Lithium[17]  34718328     [17]
1    OAS1  Lithium[17][7]  34718328  [17][7]
  • Related