I am trying to do some analysis of the duplicates prior to drop all of them and keep the first or the last result. I want to check the quality of the duplicates in order to keep one or remove both value.
I want to put some conditions:
If pIC50 of the duplicates have a difference lower than 0.2 in the pIC50 column, I want to keep one and the the value as a mean. If it is higher than that, drop both rows.
hHDAC6_dup = pd.DataFrame(columns=["molecule_chembl_id", "pIC50"], data=[["CHEMBL407959", 6.468521], ["CHEMBL98", 6.795880], ["CHEMBL98", 7.721246], ["CHEMBL98", 7.75]])
if (hHDAC6_dup.molecule_chembl_id[1] == hHDAC6_dup.molecule_chembl_id[2]) and (abs(hHDAC6_dup.pIC50[1] - hHDAC6_dup.pIC50[2]) > 0.2):
hHDAC6_dup.drop([1, 2], axis=0, inplace=True)
elif (hHDAC6_dup.molecule_chembl_id[1] == hHDAC6_dup.molecule_chembl_id[2]) and (abs(hHDAC6_dup.pIC50[1] - hHDAC6_dup.pIC50[2]) <= 0.2):
hHDAC6_dup.pIC50[1] = (hHDAC6_dup.pIC50[1] hHDAC6_dup.pIC50[2])/2
hHDAC6_dup.drop(2, axis=0, inplace=True)
else:
pass
I am getting the following error:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
~\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
3079 try:
-> 3080 return self._engine.get_loc(casted_key)
3081 except KeyError as err:
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()
pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()
KeyError: 1
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
<ipython-input-72-b55b2493278b> in <module>
----> 1 if (hHDAC6_dup.molecule_chembl_id[1] == hHDAC6_dup.molecule_chembl_id[2]) and (abs(hHDAC6_dup.pIC50[1] - hHDAC6_dup.pIC50[2]) > 0.2):
2 hHDAC6_dup.drop([1, 2], axis=0, inplace=True)
3 elif (hHDAC6_dup.molecule_chembl_id[1] == hHDAC6_dup.molecule_chembl_id[2]) and (abs(hHDAC6_dup.pIC50[1] - hHDAC6_dup.pIC50[2]) <= 0.2):
4 hHDAC6_dup.pIC50[1] = (hHDAC6_dup.pIC50[1] hHDAC6_dup.pIC50[2])/2
5 hHDAC6_dup.drop(2, axis=0, inplace=True)
~\Anaconda3\lib\site-packages\pandas\core\series.py in __getitem__(self, key)
851
852 elif key_is_scalar:
--> 853 return self._get_value(key)
854
855 if is_hashable(key):
~\Anaconda3\lib\site-packages\pandas\core\series.py in _get_value(self, label, takeable)
959
960 # Similar to Index.get_value, but we do not fall back to positional
--> 961 loc = self.index.get_loc(label)
962 return self.index._get_values_for_loc(self, loc, label)
963
~\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
3080 return self._engine.get_loc(casted_key)
3081 except KeyError as err:
-> 3082 raise KeyError(key) from err
3083
3084 if tolerance is not None:
KeyError: 1
I would need to iterate to do this analysis in the whole df.
Thanks for the help.
CodePudding user response:
If you write your own aggregation function, you can use groupby
to do this:
import numpy as np
import pandas as pd
def mean_if_close(seq, tolerance=0.2):
span = np.max(seq) - np.min(seq)
if span <= tolerance:
return np.mean(seq)
return np.nan
df = pd.DataFrame({'id': ['a', 'b', 'b', 'c', 'c', 'c'],
'value': [1.5, 3, 3.3, 4.1, 4.15, 4.2]})
df.groupby('id').agg(mean_if_close).dropna()
value
id
a 1.50
c 4.15