I have the following dataframe:
ID Job Start Date
1 Driver 1951-01-01
1 Cleaner 2000-05-01
1 Staff 1951-01-01
2 Staff 2000-05-02
2 Staff2 2000-05-01
2 Cleaner 2000-04-01
5 Driver 1951-01-01
I need to create a column named "Primary?" that has the value "Yes" or "No" depending of the Start Date
For each ID, set the column to "Yes" for the earliest "Start Date" for the ID. If there is 2 rows tied with same "Start Date" pick one to set to "Yes" (can be the first one that appears)
All other rows set to "No" for the ID
In that case, each ID would have a row with a "Yes" (just 1 single row) and all other rows to "No" resulting in this dataframe:
ID Job Start Date Primary?
1 Driver 1951-01-01 Yes
1 Cleaner 2000-05-01 No
1 Staff 1951-01-01 No
2 Staff 2000-05-02 No
2 Staff2 2000-05-01 Yes
2 Cleaner 2000-04-01 Yes
5 Driver 1951-01-01 Yes
What is the best way to do it?
CodePudding user response:
You can use transform
to get the first date that appears per id, and then use np.where
:
df['Primary'] = np.where(df['Start Date'] == df.groupby('ID')['Start Date'].\
transform('first'), 'Yes', 'No')
or, you can set transform('min')
if you want to convert the minimum date to Yes
.
CodePudding user response:
# set the primary start-date row as True/False, when its a min for the ID
df['Primary']= df['Start_Date'].eq(df.groupby(['ID'])['Start_Date'].transform(min))
# identify the duplicates start-dates
df.loc[df.duplicated(subset=['ID','Primary'], keep='first'),'Primary' ] = False
df
# Map True/False to Yes/No
df['Primary']=df['Primary'].map({True: 'Yes', False: 'No'})
df
---- ----- ---------- ------------- ---------
| | ID | Job | Start_Date | Primary |
---- ----- ---------- ------------- ---------
| 0 | 1 | Driver | 1951-01-01 | Yes |
| 1 | 1 | Cleaner | 2000-05-01 | No |
| 2 | 1 | Staff | 1951-01-01 | No |
| 3 | 2 | Staff | 2000-05-02 | No |
| 4 | 2 | Staff2 | 2000-05-01 | No |
| 5 | 2 | Cleaner | 2000-04-01 | Yes |
| 6 | 5 | Driver | 1951-01-01 | Yes |
---- ----- ---------- ------------- ---------
CodePudding user response:
This solution handles works:
(df
.assign(primary=lambda x: x.groupby("ID")["Start"].transform("min"))
.assign(keep=lambda x: x.groupby("ID")["Start"].transform(lambda x: x.duplicated("first")))
.assign(primary=lambda x: np.select([x.Start == x.primary],
["Yes"],
default="No")
)
.assign(primary=lambda x: np.select([(x.primary == "Yes") & ~(x.keep)],
["Yes"],
default="No")
)
)