Set column as yes for first Start Date-CodePudding

I have the following dataframe:

ID  Job     Start Date
1   Driver  1951-01-01
1   Cleaner 2000-05-01
1   Staff   1951-01-01
2   Staff   2000-05-02
2   Staff2  2000-05-01
2   Cleaner 2000-04-01
5   Driver  1951-01-01

I need to create a column named "Primary?" that has the value "Yes" or "No" depending of the Start Date

For each ID, set the column to "Yes" for the earliest "Start Date" for the ID. If there is 2 rows tied with same "Start Date" pick one to set to "Yes" (can be the first one that appears)
All other rows set to "No" for the ID

In that case, each ID would have a row with a "Yes" (just 1 single row) and all other rows to "No" resulting in this dataframe:

ID  Job     Start Date  Primary?
1   Driver  1951-01-01  Yes
1   Cleaner 2000-05-01  No
1   Staff   1951-01-01  No
2   Staff   2000-05-02  No
2   Staff2  2000-05-01  Yes
2   Cleaner 2000-04-01  Yes
5   Driver  1951-01-01  Yes

What is the best way to do it?

CodePudding user response：

You can use transform to get the first date that appears per id, and then use np.where:

df['Primary'] =  np.where(df['Start Date'] == df.groupby('ID')['Start Date'].\
                             transform('first'), 'Yes', 'No')

or, you can set transform('min') if you want to convert the minimum date to Yes.

CodePudding user response：

# set the primary start-date row as True/False, when its a min for the ID

df['Primary']= df['Start_Date'].eq(df.groupby(['ID'])['Start_Date'].transform(min))

# identify the duplicates start-dates
df.loc[df.duplicated(subset=['ID','Primary'], keep='first'),'Primary' ] = False
df
# Map True/False to Yes/No
df['Primary']=df['Primary'].map({True: 'Yes', False: 'No'})
df

 ---- ----- ---------- ------------- --------- 
|    | ID  |   Job    | Start_Date  | Primary |
 ---- ----- ---------- ------------- --------- 
| 0  |  1  | Driver   | 1951-01-01  | Yes     |
| 1  |  1  | Cleaner  | 2000-05-01  | No      |
| 2  |  1  | Staff    | 1951-01-01  | No      |
| 3  |  2  | Staff    | 2000-05-02  | No      |
| 4  |  2  | Staff2   | 2000-05-01  | No      |
| 5  |  2  | Cleaner  | 2000-04-01  | Yes     |
| 6  |  5  | Driver   | 1951-01-01  | Yes     |
 ---- ----- ---------- ------------- ---------

CodePudding user response：

This solution handles works:

(df
 .assign(primary=lambda x: x.groupby("ID")["Start"].transform("min"))
 .assign(keep=lambda x: x.groupby("ID")["Start"].transform(lambda x: x.duplicated("first")))
 .assign(primary=lambda x: np.select([x.Start == x.primary],
                                     ["Yes"],
                                     default="No")
       )
 .assign(primary=lambda x: np.select([(x.primary == "Yes") & ~(x.keep)],
                                     ["Yes"],
                                     default="No")
       )
)