Take this table with two colums:
Fruit | First Occurence? |
---|---|
"Apple" | "Yes" |
"Orange" | "Yes" |
"Apple" | "Yes" |
"Apple" | "Yes" |
"Kiwi" | "Yes" |
"Banana" | "Yes" |
"Apple" | "Yes" |
"Orange" | "Yes" |
"Orange" | "Yes" |
"Banana" | "Yes" |
Made using this:
import pandas as pd
df = pd.DataFrame({
'Fruit': ["Apple", "Orange", "Apple", "Apple", "Kiwi", "Banana", "Apple", "Orange", "Orange", "Banana"],
'First Occurence?': ["Yes"]*10
})
How can I efficiently change values of the second column where the value in the first column has already appeared at least once? The desired result being this:
Fruit | First Occurence? |
---|---|
"Apple" | "Yes" |
"Orange" | "Yes" |
"Apple" | "No" |
"Apple" | "No" |
"Kiwi" | "Yes" |
"Banana" | "Yes" |
"Apple" | "No" |
"Orange" | "No" |
"Orange" | "No" |
"Banana" | "No" |
- It can't be a loop as this takes too long
- The rows must end up in the same order they started
Thanks in advance!
CodePudding user response:
you can use duplicated to identify the duplicate values and then using mask, assign a NO
# duplicated Fruits will come up as False, and First_occurrence be made to No
df['First Occurence?']= df['First Occurence?'].mask(df.duplicated('Fruit'),'No')
Fruit First Occurence?
0 Apple Yes
1 Orange Yes
2 Apple No
3 Apple No
4 Kiwi Yes
5 Banana Yes
6 Apple No
7 Orange No
8 Orange No
9 Banana No
CodePudding user response:
Use duplicated
and numpy.where
:
df['First Occurence?'] = np.where(df['Fruit'].duplicated(), 'No', 'Yes')
More generic alternative with groupby.cumcount
that enable to select any nth item (0-based):
df['First Occurence?'] = np.where(df.groupby('Fruit').cumcount().eq(0), 'Yes', 'No')
output:
Fruit First Occurence?
0 Apple Yes
1 Orange Yes
2 Apple No
3 Apple No
4 Kiwi Yes
5 Banana Yes
6 Apple No
7 Orange No
8 Orange No
9 Banana No