import pandas as pd
data_list = [['Name', 'Fruit'],
['Abel', 'Apple'],
['Abel', 'Pear'],
['Abel', 'Coconut'],
['Abel', 'Pear'],
['Benny', 'Apple'],
['Benny', 'Apple'],
['Cain', 'Apple'],
['Cain', 'Coconut'],
['Cain', 'Pear'],
['Cain', 'Lemon'],
['Cain', 'Orange']]
record_df = pd.DataFrame(data_list[1:], columns = data_list[0])
I am trying to create another DataFrame to tell me if someone has the same fruit.
Expected Output:
Name | Repeated_Fruits
Abel | 1
Benny| 1
Cain | 0
I have tried
bool_series = record_df.duplicated(subset=['Name'], keep=False)
record_df_2 = record_df[~bool_series]
But everything is True. What am I missing?
CodePudding user response:
You can do pd.crosstab
out = pd.crosstab(df.Name, df.Fruit).gt(1).sum(axis=1).reset_index(name='rep_name')
Out[10]:
Name rep_name
0 Abel 1
1 Benny 1
2 Cain 0
CodePudding user response:
The subset
parameter is for saying where you're looking for duplicates, so it should be 'Fruit'
. The name
column is what you want to group by, so you can do:
record_df.groupby('Name').apply(lambda x: x.duplicated().sum())
Name
Abel 1
Benny 1
Cain 0
dtype: int64
You say that you want 0 and 1 as output, but this will give larger numbers if there are more than one duplicate.