Home > OS >  Adding new data based on maximum duplicates and number of minimum counts
Adding new data based on maximum duplicates and number of minimum counts

Time:02-12

import pandas as pd
data_list = [['Name', 'Fruit'],
              ['Abel', 'Apple'],
              ['Abel', 'Pear'],
              ['Abel', 'Coconut'],
              ['Abel', 'Pear'],
              ['Benny', 'Apple'],
              ['Benny', 'Apple'],
              ['Cain', 'Apple'],
              ['Cain', 'Coconut'],
              ['Cain', 'Pear'],
              ['Cain', 'Lemon'],
              ['Cain', 'Orange']]

record_df = pd.DataFrame(data_list[1:], columns = data_list[0])
record_df

I am trying to create 2 extra data column in a pandas dataframe.

  1. Count of the highest duplicate.
  2. Count of fruits with the lowest duplicate.

So the end results would be

Name | Count of Highest Duplicate | Count of fruits with lowest duplicate
Abel | 2 | 2
Benny | 2 | 1
Cain | 1 | 5

I have tried

record_df.groupby('Name')['Fruit'].nunique().reset_index()

and

record_df.groupby('Name')['Fruit'].value_counts()

but am unsure where to proceed from here.

CodePudding user response:

Use value_counts then groupby:

out = df.value_counts(['Name', 'Fruit']).groupby(level='Name') \
        .agg(**{'Count of Highest Duplicate': 'max',
                'Count of fruits with lowest duplicate': lambda x: sum(x==x.min())})
print(out)

# Output
       Count of Highest Duplicate  Count of fruits with lowest duplicate
Name                                                                    
Abel                            2                                      2
Benny                           2                                      1
Cain                            1                                      5

CodePudding user response:

You can use crosstab and standard operations:

df2 = pd.crosstab(record_df['Name'], record_df['Fruit'])
df2 = df2.where(df2.ne(0))

pd.DataFrame({'max': df2.max(axis=1),
              'count(min)': df2.eq(df2.min(axis=1), axis=0).sum(axis=1)},
             index=df2.index)

output:

       max  count(min)
Name                  
Abel     2           2
Benny    2           1
Cain     1           5
  • Related