Home > Mobile >  Duplicate analysis using conditions
Duplicate analysis using conditions

Time:12-16

I am trying to do some analysis of the duplicates prior to drop all of them and keep the first or the last result. I want to check the quality of the duplicates in order to keep one or remove both value.

I want to put some conditions:

If pIC50 of the duplicates have a difference lower than 0.2 in the pIC50 column, I want to keep one and the the value as a mean. If it is higher than that, drop both rows.

hHDAC6_dup = pd.DataFrame(columns=["molecule_chembl_id", "pIC50"], data=[["CHEMBL407959", 6.468521], ["CHEMBL98", 6.795880], ["CHEMBL98", 7.721246], ["CHEMBL98", 7.75]])

if (hHDAC6_dup.molecule_chembl_id[1] == hHDAC6_dup.molecule_chembl_id[2]) and (abs(hHDAC6_dup.pIC50[1] - hHDAC6_dup.pIC50[2]) > 0.2):
    hHDAC6_dup.drop([1, 2], axis=0, inplace=True)
elif (hHDAC6_dup.molecule_chembl_id[1] == hHDAC6_dup.molecule_chembl_id[2]) and (abs(hHDAC6_dup.pIC50[1] - hHDAC6_dup.pIC50[2]) <= 0.2):
    hHDAC6_dup.pIC50[1] = (hHDAC6_dup.pIC50[1]   hHDAC6_dup.pIC50[2])/2 
    hHDAC6_dup.drop(2, axis=0, inplace=True)
else:
    pass

I am getting the following error:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   3079             try:
-> 3080                 return self._engine.get_loc(casted_key)
   3081             except KeyError as err:

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()

KeyError: 1

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
<ipython-input-72-b55b2493278b> in <module>
----> 1 if (hHDAC6_dup.molecule_chembl_id[1] == hHDAC6_dup.molecule_chembl_id[2]) and (abs(hHDAC6_dup.pIC50[1] - hHDAC6_dup.pIC50[2]) > 0.2):
      2     hHDAC6_dup.drop([1, 2], axis=0, inplace=True)
      3 elif (hHDAC6_dup.molecule_chembl_id[1] == hHDAC6_dup.molecule_chembl_id[2]) and (abs(hHDAC6_dup.pIC50[1] - hHDAC6_dup.pIC50[2]) <= 0.2):
      4     hHDAC6_dup.pIC50[1] = (hHDAC6_dup.pIC50[1]   hHDAC6_dup.pIC50[2])/2
      5     hHDAC6_dup.drop(2, axis=0, inplace=True)

~\Anaconda3\lib\site-packages\pandas\core\series.py in __getitem__(self, key)
    851 
    852         elif key_is_scalar:
--> 853             return self._get_value(key)
    854 
    855         if is_hashable(key):

~\Anaconda3\lib\site-packages\pandas\core\series.py in _get_value(self, label, takeable)
    959 
    960         # Similar to Index.get_value, but we do not fall back to positional
--> 961         loc = self.index.get_loc(label)
    962         return self.index._get_values_for_loc(self, loc, label)
    963 

~\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   3080                 return self._engine.get_loc(casted_key)
   3081             except KeyError as err:
-> 3082                 raise KeyError(key) from err
   3083 
   3084         if tolerance is not None:

KeyError: 1

I would need to iterate to do this analysis in the whole df.

Thanks for the help.

CodePudding user response:

If you write your own aggregation function, you can use groupby to do this:

import numpy as np
import pandas as pd

def mean_if_close(seq, tolerance=0.2):
    span = np.max(seq) - np.min(seq)
    if span <= tolerance:
        return np.mean(seq)
    return np.nan

df = pd.DataFrame({'id': ['a', 'b', 'b', 'c', 'c', 'c'],
                   'value': [1.5, 3, 3.3, 4.1, 4.15, 4.2]})

df.groupby('id').agg(mean_if_close).dropna()
    value
id  
a   1.50
c   4.15
  • Related