Group a column and take the value that appears several times in another column in Pandas-CodePudding

I would like to know how I can take the value that appears several times in one column, which is linked to another column. Let me explain myself better with an example: I have a Dataframe with two columns: 'Provider' and 'PotentialFraud', in the 'Provider' column the same Provider code appears several times and in 'PotentialFraud' it is a Boolean field. I have to create another Dataframe in which in the 'Provider' column appears only once each Provider code, and in the 'PotentialFraud' field Appears True if in the first DataFrame with that code there are more True, otherwise False.

Example of the first DataFrame:

Provider	PotentialFraud
First	True
Second	True
First	True
Second	False
Third	False
Second	True

The second DataFrame will be:

Provider	PotentialFraud
First	True
Second	True
Third	False

I tried to write this code:

final_provider_dataset = pd.DataFrame(columns=['Provider', 'PotentialFraud'])
for provider in provider_test_dataset['Provider'].unique():
    potential_fraud_true = provider_test_dataset[provider_test_dataset['Provider'] == provider] ['PotentialFraud'].value_counts()[True]
    potential_fraud_false = provider_test_dataset[provider_test_dataset['Provider'] == provider] ['PotentialFraud'].value_counts()[False]
    if potential_fraud_true > potential_fraud_false:
        final_provider_dataset = final_provider_dataset.append({'Provider': provider, 'PotentialFraud': True}, ignore_index=True)
    else:
        final_provider_dataset = final_provider_dataset.append({'Provider': provider, 'PotentialFraud': False}, ignore_index=True)

But this error comes out:

KeyError                                  Traceback (most recent call last)
File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\pandas\core\indexes\base.py:3621, in Index.get_loc(self, key, method, tolerance)
   3620 try:
-> 3621     return self._engine.get_loc(casted_key)
   3622 except KeyError as err:

File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\pandas\_libs\index.pyx:136, in pandas._libs.index.IndexEngine.get_loc()

File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\pandas\_libs\index.pyx:144, in pandas._libs.index.IndexEngine.get_loc()

File pandas\_libs\index_class_helper.pxi:41, in pandas._libs.index.Int64Engine._check_type()

KeyError: True

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
*localdirectory*\main.ipynb Cella 121 in <cell line: 4>()
      3 final_provider_dataset = pd.DataFrame(columns=['Provider', 'PotentialFraud'])
      4 for provider in provider_test_dataset['Provider'].unique():
----> 5     potential_fraud_true = provider_test_dataset[provider_test_dataset['Provider'] == provider] ['PotentialFraud'].value_counts()[True]
      6     potential_fraud_false = provider_test_dataset[provider_test_dataset['Provider'] == provider] ['PotentialFraud'].value_counts()[False]
      7     if potential_fraud_true > potential_fraud_false:
...
   3626     #  InvalidIndexError. Otherwise we fall through and re-raise
   3627     #  the TypeError.
   3628     self._check_indexing_error(key)

KeyError: True

What can I do? Thanks in advance.

CodePudding user response：

IIUC use Series.mode for value with maximal count. If equal maximal count is selected first value by Series.iat per groups:

f = lambda x: x.mode().iat[0]
df = df.groupby('Provider', as_index=False)['PotentialFraud'].agg(f)

CodePudding user response：

Compare the mean to 0.5, if greater than it, there is more True:

df.groupby('Provider', as_index=False)['PotentialFraud'].agg(lambda x: x.mean() >= 0.5)

Output:

  Provider  PotentialFraud
0    First            True
1   Second            True
2    Third           False