How to groupby certain column in a dataframe in pandas?-CodePudding

I have the following dataframe where I have different genes, drugs ID's and citations. I essentially need the same gene to be merged with the same drug but include both citations for that drug if it is to occur. For example below: pharmacogenomic

      Gene                          Drug                     ID     Cite
1  MAD1L1                       Lithium[17]           34718328     [17]
2    OAS1                       Lithium[17]           34718328     [17]
3    OAS1                       Lithium[7]            27401222      [7]

MAD1L1 has lithium and citation 17, but OAS1 has lithium and citation 17 and 7. I would like to concat the table into something similar to below:

      Gene                          Drug                     ID     Cite
1  MAD1L1                       Lithium[17]           34718328     [17]
2    OAS1                       Lithium[17][7]        34718328     [17]

OAS1 has lithium ,but both citation are next to eachother, and MAD1L1 is unchanged as it does not share the same citation for lithium as OAS1.

CodePudding user response：

here is one way to do it

#use cite to group together the citations
df['cite2']=df.groupby('Gene')['Cite'].transform('sum')

#group by gene, and take the first result for each gene
df2=df.groupby('Gene').first()

#split the citation from the Drug name and append the cite2 (created above)
df2['Drug']=df2['Drug'].str.split('[', expand=True)[0]   df2['cite2']

# drop the temporary cite2 columns
df2.drop(columns='cite2', inplace=True)
df2.reset_index()

    Gene    Drug    ID  Cite
0   MAD1L1  Lithium[17]     34718328    [17]
1   OAS1    Lithium[17][7]  34718328    [17]

CodePudding user response：

Remove the citation from "Drug", then groupby.agg, either as 'first' or to join the strings. Then add back the citations:

out = (df
 .assign(Drug=df['Drug'].str.extract(r'(^[^\[\]] )', expand=False))
 .groupby(['Gene', 'Drug'], as_index=False)
 .agg({'ID': 'first', 'Cite': ''.join})
 .assign(Drug=lambda d: d['Drug'] d['Cite'])
)

Output:

     Gene            Drug        ID     Cite
0  MAD1L1     Lithium[17]  34718328     [17]
1    OAS1  Lithium[17][7]  34718328  [17][7]