How do I get the range in one dataframe column based on duplicate items in two other columns?-CodePudding

I have a dataframe that contains three columns: 'sequences', 'smiles' and 'labels'. Some of the rows have the same string entries in the 'sequences' and 'smiles' column, but a different float value in the 'labels' column. For duplicate sequences and smiles, I would like the get the range of values of the 'labels' column for those duplicate rows, which will be stored in a fourth column. I intend to reject rows, which have a range above a certain value.

I have made a dataframe that contains all the duplicate values:

duplicate_df = pd.concat(g for _, g in df.groupby(['sequence', 'smiles']) if len(g) > 1)

How do I get the range of the labels from the df?

Is there something like this I can do?

duplicate_df.groupby(['Target_sequence', 'processed_SMILES']).range()

My duplicate_df looks like this:

pd.DataFrame({'Label': {86468: 55700.0,
  86484: 55700.0,
  86508: 55700.0,
  124549: 55690.0,
  124588: 55690.0},
 'Target_sequence': {86468: 'AAPYLKTKFICVTPTTCSNTIDLPMSPRTLDSLMQFGNGEGAEPSAGGQF',
  86484: 'AAPYLKTKFICVTPTTCSNTIDLPMSPRTLDSLMQFGNGEGAEPSAGGQF',
  86508: 'AAPYLKTKFICVTPTTCSNTIDLPMSPRTLDSLMQFGNGEGAEPSAGGQF',
  124549: 'AAPYLKTKFICVTPTTCSNTIDLPMSPRTLDSLMQFGNGEGAEPSAGGQF',
  124588: 'AAPYLKTKFICVTPTTCSNTIDLPMSPRTLDSLMQFGNGEGAEPSAGGQF'},
 'processed_SMILES': {86468: 'CCOC(=O)[NH ]1CC[NH ](C(=O)c2ccc(-n3c(=S)[n-]c4ccccc4c3=O)cc2)CC1',
  86484: 'C[NH ]1CC[NH ](Cc2nc3ccccc3c(=O)n2Cc2nc(-c3ccccc3F)cs2)CC1',
  86508: 'C[NH ]1CC[NH ](Cc2nc3ccccc3c(=O)n2Cc2nc(-c3cccc([N ](=O)[O-])c3)cs2)CC1',
  124549: 'C[NH ]1CC[NH ](Cc2nc3ccccc3c(=O)n2Cc2nc(-c3cccc([N ](=O)[O-])c3)cs2)CC1',
  124588: 'CCOC(=O)[NH ]1CC[NH ](C(=O)c2ccc(-n3c(=S)[n-]c4ccccc4c3=O)cc2)CC1'}})

For example, duplicate rows where the items are the same I would like to have 0 in the 'range' column.

CodePudding user response：

std() is a valid aggregation function for group-by object. Therefore, after creating your df with the duplicated data, you can try:

duplicate_df.groupby(['Target_sequence', 'processed_SMILES'])['labels'].std()

Edit:

This is a nice opportunity to use pd.NamedAgg which was released in version 0.25:

df.groupby(['Target_sequence','processed_SMILES']).agg(Minimum = pd.NamedAgg(column='Label',aggfunc='min'),
                                                   Maximum = pd.NamedAgg(column='Label',aggfunc='max'))