Home > Enterprise >  Finding duplicates in Dataframe and returning 1s and 0s
Finding duplicates in Dataframe and returning 1s and 0s

Time:02-13

import pandas as pd
data_list = [['Name', 'Fruit'],
              ['Abel', 'Apple'],
              ['Abel', 'Pear'],
              ['Abel', 'Coconut'],
              ['Abel', 'Pear'],
              ['Benny', 'Apple'],
              ['Benny', 'Apple'],
              ['Cain', 'Apple'],
              ['Cain', 'Coconut'],
              ['Cain', 'Pear'],
              ['Cain', 'Lemon'],
              ['Cain', 'Orange']]

record_df = pd.DataFrame(data_list[1:], columns = data_list[0])

I am trying to create another DataFrame to tell me if someone has the same fruit.

Expected Output:

Name | Repeated_Fruits
Abel | 1
Benny| 1
Cain | 0

I have tried

bool_series = record_df.duplicated(subset=['Name'], keep=False)
record_df_2 = record_df[~bool_series]

But everything is True. What am I missing?

CodePudding user response:

You can do pd.crosstab

out = pd.crosstab(df.Name, df.Fruit).gt(1).sum(axis=1).reset_index(name='rep_name')
Out[10]: 
    Name  rep_name
0   Abel         1
1  Benny         1
2   Cain         0

CodePudding user response:

The subset parameter is for saying where you're looking for duplicates, so it should be 'Fruit'. The name column is what you want to group by, so you can do:

record_df.groupby('Name').apply(lambda x: x.duplicated().sum())

Name
Abel     1
Benny    1
Cain     0
dtype: int64

You say that you want 0 and 1 as output, but this will give larger numbers if there are more than one duplicate.

  • Related