I'm working on the titanic dataset and figured that determining if a ticket is unique or not might be predictive. I've written some code below that tells us if the value is unique or not but I have a feeling that there is a much cleaner way of doing this.
import pandas as pd
dummy_data = pd.DataFrame({"Ticket" : ['A','A','B','B','C','D']
})
dummy_data
counts = dummy_data['Ticket'].value_counts()
a = counts[counts>1]
b=a.reset_index(name='Quantity')
b = b.rename(columns={'index': 'Tickets'})
b
dummy_data['Ticket_multiple'] = np.where(dummy_data['Ticket'].isin(b['Tickets']),1,0)
dummy_data.head(10)
CodePudding user response:
You can use pandas.Series.duplicated
with keep=False
:
Parameters: keep{‘first’, ‘last’, False}, default ‘first’ : Method to handle dropping duplicates:
- ‘first’ : Mark duplicates as True except for the first occurrence.
- ‘last’ : Mark duplicates as True except for the last occurrence.
- False : Mark all duplicates as True.
dummy_data['Ticket_multiple'] = dummy_data["Ticket"].duplicated(keep=False).astype(int)
Output :
print(dummy_data)
Ticket Ticket_multiple
0 A 1
1 A 1
2 B 1
3 B 1
4 C 0
5 D 0
Update :
If you need to consider a value duplicated only if it occurs (>N
) times, use this :
dummy_data = pd.DataFrame({"Ticket" : ['A','A','A','B','B','C','D']})
N = 2 #threshold
m1 = dummy_data.groupby("Ticket").transform("size").gt(N)
m2 = dummy_data["Ticket"].duplicated(keep=False)
dummy_data["Ticket_multiple (threshold)"] = (m1&m2).astype(int)
print(dummy_data)
Ticket Ticket_multiple Ticket_multiple (threshold)
0 A 1 1
1 A 1 1
2 A 1 1
3 B 1 0
4 B 1 0
5 C 0 0
6 D 0 0