Home > Back-end >  Set column as yes for first Start Date
Set column as yes for first Start Date

Time:09-23

I have the following dataframe:

ID  Job     Start Date
1   Driver  1951-01-01
1   Cleaner 2000-05-01
1   Staff   1951-01-01
2   Staff   2000-05-02
2   Staff2  2000-05-01
2   Cleaner 2000-04-01
5   Driver  1951-01-01

I need to create a column named "Primary?" that has the value "Yes" or "No" depending of the Start Date

  • For each ID, set the column to "Yes" for the earliest "Start Date" for the ID. If there is 2 rows tied with same "Start Date" pick one to set to "Yes" (can be the first one that appears)

  • All other rows set to "No" for the ID

In that case, each ID would have a row with a "Yes" (just 1 single row) and all other rows to "No" resulting in this dataframe:

ID  Job     Start Date  Primary?
1   Driver  1951-01-01  Yes
1   Cleaner 2000-05-01  No
1   Staff   1951-01-01  No
2   Staff   2000-05-02  No
2   Staff2  2000-05-01  Yes
2   Cleaner 2000-04-01  Yes
5   Driver  1951-01-01  Yes

What is the best way to do it?

CodePudding user response:

You can use transform to get the first date that appears per id, and then use np.where:

df['Primary'] =  np.where(df['Start Date'] == df.groupby('ID')['Start Date'].\
                             transform('first'), 'Yes', 'No')

or, you can set transform('min') if you want to convert the minimum date to Yes.

CodePudding user response:

# set the primary start-date row as True/False, when its a min for the ID

df['Primary']= df['Start_Date'].eq(df.groupby(['ID'])['Start_Date'].transform(min))

# identify the duplicates start-dates
df.loc[df.duplicated(subset=['ID','Primary'], keep='first'),'Primary' ] = False
df
# Map True/False to Yes/No
df['Primary']=df['Primary'].map({True: 'Yes', False: 'No'})
df

 ---- ----- ---------- ------------- --------- 
|    | ID  |   Job    | Start_Date  | Primary |
 ---- ----- ---------- ------------- --------- 
| 0  |  1  | Driver   | 1951-01-01  | Yes     |
| 1  |  1  | Cleaner  | 2000-05-01  | No      |
| 2  |  1  | Staff    | 1951-01-01  | No      |
| 3  |  2  | Staff    | 2000-05-02  | No      |
| 4  |  2  | Staff2   | 2000-05-01  | No      |
| 5  |  2  | Cleaner  | 2000-04-01  | Yes     |
| 6  |  5  | Driver   | 1951-01-01  | Yes     |
 ---- ----- ---------- ------------- --------- 

CodePudding user response:

This solution handles works:

(df
 .assign(primary=lambda x: x.groupby("ID")["Start"].transform("min"))
 .assign(keep=lambda x: x.groupby("ID")["Start"].transform(lambda x: x.duplicated("first")))
 .assign(primary=lambda x: np.select([x.Start == x.primary],
                                     ["Yes"],
                                     default="No")
       )
 .assign(primary=lambda x: np.select([(x.primary == "Yes") & ~(x.keep)],
                                     ["Yes"],
                                     default="No")
       )
)
  • Related