import pandas as pd
data_list = [['Name', 'Fruit'],
['Abel', 'Apple'],
['Abel', 'Pear'],
['Abel', 'Coconut'],
['Abel', 'Pear'],
['Benny', 'Apple'],
['Benny', 'Apple'],
['Cain', 'Apple'],
['Cain', 'Coconut'],
['Cain', 'Pear'],
['Cain', 'Lemon'],
['Cain', 'Orange']]
record_df = pd.DataFrame(data_list[1:], columns = data_list[0])
record_df
I am trying to create 2 extra data column in a pandas dataframe.
- Count of the highest duplicate.
- Count of fruits with the lowest duplicate.
So the end results would be
Name | Count of Highest Duplicate | Count of fruits with lowest duplicate
Abel | 2 | 2
Benny | 2 | 1
Cain | 1 | 5
I have tried
record_df.groupby('Name')['Fruit'].nunique().reset_index()
and
record_df.groupby('Name')['Fruit'].value_counts()
but am unsure where to proceed from here.
CodePudding user response:
Use value_counts
then groupby
:
out = df.value_counts(['Name', 'Fruit']).groupby(level='Name') \
.agg(**{'Count of Highest Duplicate': 'max',
'Count of fruits with lowest duplicate': lambda x: sum(x==x.min())})
print(out)
# Output
Count of Highest Duplicate Count of fruits with lowest duplicate
Name
Abel 2 2
Benny 2 1
Cain 1 5
CodePudding user response:
You can use crosstab
and standard operations:
df2 = pd.crosstab(record_df['Name'], record_df['Fruit'])
df2 = df2.where(df2.ne(0))
pd.DataFrame({'max': df2.max(axis=1),
'count(min)': df2.eq(df2.min(axis=1), axis=0).sum(axis=1)},
index=df2.index)
output:
max count(min)
Name
Abel 2 2
Benny 2 1
Cain 1 5