I would like to know how I can take the value that appears several times in one column, which is linked to another column. Let me explain myself better with an example: I have a Dataframe with two columns: 'Provider' and 'PotentialFraud', in the 'Provider' column the same Provider code appears several times and in 'PotentialFraud' it is a Boolean field. I have to create another Dataframe in which in the 'Provider' column appears only once each Provider code, and in the 'PotentialFraud' field Appears True if in the first DataFrame with that code there are more True, otherwise False.
Example of the first DataFrame:
Provider | PotentialFraud |
---|---|
First | True |
Second | True |
First | True |
Second | False |
Third | False |
Second | True |
The second DataFrame will be:
Provider | PotentialFraud |
---|---|
First | True |
Second | True |
Third | False |
I tried to write this code:
final_provider_dataset = pd.DataFrame(columns=['Provider', 'PotentialFraud'])
for provider in provider_test_dataset['Provider'].unique():
potential_fraud_true = provider_test_dataset[provider_test_dataset['Provider'] == provider] ['PotentialFraud'].value_counts()[True]
potential_fraud_false = provider_test_dataset[provider_test_dataset['Provider'] == provider] ['PotentialFraud'].value_counts()[False]
if potential_fraud_true > potential_fraud_false:
final_provider_dataset = final_provider_dataset.append({'Provider': provider, 'PotentialFraud': True}, ignore_index=True)
else:
final_provider_dataset = final_provider_dataset.append({'Provider': provider, 'PotentialFraud': False}, ignore_index=True)
But this error comes out:
KeyError Traceback (most recent call last)
File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\pandas\core\indexes\base.py:3621, in Index.get_loc(self, key, method, tolerance)
3620 try:
-> 3621 return self._engine.get_loc(casted_key)
3622 except KeyError as err:
File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\pandas\_libs\index.pyx:136, in pandas._libs.index.IndexEngine.get_loc()
File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\pandas\_libs\index.pyx:144, in pandas._libs.index.IndexEngine.get_loc()
File pandas\_libs\index_class_helper.pxi:41, in pandas._libs.index.Int64Engine._check_type()
KeyError: True
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
*localdirectory*\main.ipynb Cella 121 in <cell line: 4>()
3 final_provider_dataset = pd.DataFrame(columns=['Provider', 'PotentialFraud'])
4 for provider in provider_test_dataset['Provider'].unique():
----> 5 potential_fraud_true = provider_test_dataset[provider_test_dataset['Provider'] == provider] ['PotentialFraud'].value_counts()[True]
6 potential_fraud_false = provider_test_dataset[provider_test_dataset['Provider'] == provider] ['PotentialFraud'].value_counts()[False]
7 if potential_fraud_true > potential_fraud_false:
...
3626 # InvalidIndexError. Otherwise we fall through and re-raise
3627 # the TypeError.
3628 self._check_indexing_error(key)
KeyError: True
What can I do? Thanks in advance.
CodePudding user response:
IIUC use Series.mode
for value with maximal count. If equal maximal count is selected first value by Series.iat
per groups:
f = lambda x: x.mode().iat[0]
df = df.groupby('Provider', as_index=False)['PotentialFraud'].agg(f)
CodePudding user response:
Compare the mean to 0.5, if greater than it, there is more True
:
df.groupby('Provider', as_index=False)['PotentialFraud'].agg(lambda x: x.mean() >= 0.5)
Output:
Provider PotentialFraud
0 First True
1 Second True
2 Third False