Duplicate analysis using conditions-CodePudding

I am trying to do some analysis of the duplicates prior to drop all of them and keep the first or the last result. I want to check the quality of the duplicates in order to keep one or remove both value.

I want to put some conditions:

If pIC50 of the duplicates have a difference lower than 0.2 in the pIC50 column, I want to keep one and the the value as a mean. If it is higher than that, drop both rows.

hHDAC6_dup = pd.DataFrame(columns=["molecule_chembl_id", "pIC50"], data=[["CHEMBL407959", 6.468521], ["CHEMBL98", 6.795880], ["CHEMBL98", 7.721246], ["CHEMBL98", 7.75]])

if (hHDAC6_dup.molecule_chembl_id[1] == hHDAC6_dup.molecule_chembl_id[2]) and (abs(hHDAC6_dup.pIC50[1] - hHDAC6_dup.pIC50[2]) > 0.2):
    hHDAC6_dup.drop([1, 2], axis=0, inplace=True)
elif (hHDAC6_dup.molecule_chembl_id[1] == hHDAC6_dup.molecule_chembl_id[2]) and (abs(hHDAC6_dup.pIC50[1] - hHDAC6_dup.pIC50[2]) <= 0.2):
    hHDAC6_dup.pIC50[1] = (hHDAC6_dup.pIC50[1]   hHDAC6_dup.pIC50[2])/2 
    hHDAC6_dup.drop(2, axis=0, inplace=True)
else:
    pass

I am getting the following error:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   3079             try:
-> 3080                 return self._engine.get_loc(casted_key)
   3081             except KeyError as err:

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()

KeyError: 1

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
<ipython-input-72-b55b2493278b> in <module>
----> 1 if (hHDAC6_dup.molecule_chembl_id[1] == hHDAC6_dup.molecule_chembl_id[2]) and (abs(hHDAC6_dup.pIC50[1] - hHDAC6_dup.pIC50[2]) > 0.2):
      2     hHDAC6_dup.drop([1, 2], axis=0, inplace=True)
      3 elif (hHDAC6_dup.molecule_chembl_id[1] == hHDAC6_dup.molecule_chembl_id[2]) and (abs(hHDAC6_dup.pIC50[1] - hHDAC6_dup.pIC50[2]) <= 0.2):
      4     hHDAC6_dup.pIC50[1] = (hHDAC6_dup.pIC50[1]   hHDAC6_dup.pIC50[2])/2
      5     hHDAC6_dup.drop(2, axis=0, inplace=True)

~\Anaconda3\lib\site-packages\pandas\core\series.py in __getitem__(self, key)
    851 
    852         elif key_is_scalar:
--> 853             return self._get_value(key)
    854 
    855         if is_hashable(key):

~\Anaconda3\lib\site-packages\pandas\core\series.py in _get_value(self, label, takeable)
    959 
    960         # Similar to Index.get_value, but we do not fall back to positional
--> 961         loc = self.index.get_loc(label)
    962         return self.index._get_values_for_loc(self, loc, label)
    963 

~\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   3080                 return self._engine.get_loc(casted_key)
   3081             except KeyError as err:
-> 3082                 raise KeyError(key) from err
   3083 
   3084         if tolerance is not None:

KeyError: 1

I would need to iterate to do this analysis in the whole df.

Thanks for the help.

CodePudding user response：

If you write your own aggregation function, you can use groupby to do this:

import numpy as np
import pandas as pd

def mean_if_close(seq, tolerance=0.2):
    span = np.max(seq) - np.min(seq)
    if span <= tolerance:
        return np.mean(seq)
    return np.nan

df = pd.DataFrame({'id': ['a', 'b', 'b', 'c', 'c', 'c'],
                   'value': [1.5, 3, 3.3, 4.1, 4.15, 4.2]})

df.groupby('id').agg(mean_if_close).dropna()

    value
id  
a   1.50
c   4.15