Home > database >  Pandas How do I create a list of duplicates from one column, and only keep the highest value for the
Pandas How do I create a list of duplicates from one column, and only keep the highest value for the

Time:09-14

I want to find all of the duplicates in the first column Primary Mod Site and only keep the highest value for all of the compounds (columns B-M) in the dataset. excel sheet

For code, I have:

#read desired excel file
df = pd.read_excel("20220825_CISLIB01_Plate-1_Rows-A-B")

#function to find the duplicates in the dataset, sections them, and remove them
#can be applied to any dataset with the same format as original excel files

def getDuplicate():
    gene = df["Primary Mod Site"]
    #creates a list of all of the duplicates in Primary Mod Site
    pd.concat(g for _, g in df.groupby("gene") if len(g) > 1)

Im stuck on what to do next. Help much appreciated!

CodePudding user response:

it helps if you post the data as code or text, to allow to reproduce.

but, IIUC, you need to groupby the column 'A' and then take the max from rest of the columns, this seems to do the trick

df["Primary Mod Site"].max()

CodePudding user response:

Based on what i noticed in the screenshot (3 first rows for example), the row with the highest values tends to have the highest value in all columns, sooo, something like this might work.

 df = df.sort_values("ONCV-1-1-1", ascending = False).drop_duplicates("Primary Mod Site", keep='first', ignore_index=True)

or if not sure if that observation is correct for all rows.

probably this would work

df = df.groupby("Primary Mod Site").max()

NB: please post a reproducible example, easy to copy paste for us to test.

  • Related