Finding duplicates in Dataframe and returning 1s and 0s-CodePudding

import pandas as pd
data_list = [['Name', 'Fruit'],
              ['Abel', 'Apple'],
              ['Abel', 'Pear'],
              ['Abel', 'Coconut'],
              ['Abel', 'Pear'],
              ['Benny', 'Apple'],
              ['Benny', 'Apple'],
              ['Cain', 'Apple'],
              ['Cain', 'Coconut'],
              ['Cain', 'Pear'],
              ['Cain', 'Lemon'],
              ['Cain', 'Orange']]

record_df = pd.DataFrame(data_list[1:], columns = data_list[0])

I am trying to create another DataFrame to tell me if someone has the same fruit.

Expected Output:

Name | Repeated_Fruits
Abel | 1
Benny| 1
Cain | 0

I have tried

bool_series = record_df.duplicated(subset=['Name'], keep=False)
record_df_2 = record_df[~bool_series]

But everything is True. What am I missing?

CodePudding user response：

You can do pd.crosstab

out = pd.crosstab(df.Name, df.Fruit).gt(1).sum(axis=1).reset_index(name='rep_name')
Out[10]: 
    Name  rep_name
0   Abel         1
1  Benny         1
2   Cain         0

CodePudding user response：

The subset parameter is for saying where you're looking for duplicates, so it should be 'Fruit'. The name column is what you want to group by, so you can do:

record_df.groupby('Name').apply(lambda x: x.duplicated().sum())

Name
Abel     1
Benny    1
Cain     0
dtype: int64

You say that you want 0 and 1 as output, but this will give larger numbers if there are more than one duplicate.