Create column that identify first date of another column pandas dataframe-CodePudding

Imagine I have a dataframe with employee IDs, their Hire Dates, the contract type (can be Employee or Contractor) and the company they where hired. Each employee may have as many rows for the same, or different companies and the same or different contract types.

ID      Hire Date   Contract Type   Company
10000   2000.01.01  Employee        Abc
10000   2001.01.01  Contractor      Zxc
10000   2000.01.01  Employee        Abc
10000   2000.01.01  Contractor      Abc
10000   2002.01.01  Employee        Cde
10000   2002.01.01  Employee        Abc
10001   1999.03.11  Employee        Zxc
10002   1989.01.01  Employee        Abc
10002   1989.01.01  Contractor      Cde
10002   1988.01.01  Contractor      Zxc
10002   1999.01.01  Employee        Abc

Per each ID, and each contract type they have, I need to identify the earliest hire date and assign that as their Primary Assignment (if the ID has 2 rows with the same contract type and same hire date in the same company just take the first value it appears and set to "Yes") for each unique Company they are hired, this way resulting on this dataframe:

ID      Hire Date   Contract Type   Company     Primary Assignment
10000   2000.01.01  Employee        Abc         Yes
10000   2001.01.01  Contractor      Zxc         Yes
10000   2000.01.01  Employee        Abc         No
10000   2000.01.01  Contractor      Abc         Yes
10000   2002.01.01  Employee        Cde         Yes
10000   2002.01.01  Employee        Abc         No
10001   1999.03.11  Employee        Zxc         Yes
10002   1989.01.01  Employee        Abc         Yes
10002   1989.01.01  Contractor      Cde         Yes
10002   1988.01.01  Contractor      Zxc         Yes
10002   1999.01.01  Employee        Abc         No

What would be the best way to achieve it?

CodePudding user response：

You can use groupby.idxmin:

date = pd.to_datetime(df['Hire Date'])
idx = date.groupby([df['ID'], df['Contract Type'], df['Company']]).idxmin()
df['Primary Assignment'] = np.where(df.index.isin(idx), 'Yes', 'No')

Alternatively, with groupby.transform and a mask if you want the flexibility to chose the value to keep:

date = pd.to_datetime(df['Hire Date'])
m1 = date.groupby([df['ID'], df['Contract Type'], df['Company']]).transform('min').eq(date)
m2 = ~df[['ID', 'Contract Type', 'Company']].duplicated()
df['Primary Assignment'] = np.where(m1&m2, 'Yes', 'No')

Output:

       ID   Hire Date Contract Type Company Primary Assignment
0   10000  2000.01.01      Employee     Abc                Yes
1   10000  2001.01.01    Contractor     Zxc                Yes
2   10000  2000.01.01      Employee     Abc                 No
3   10000  2000.01.01    Contractor     Abc                Yes
4   10000  2002.01.01      Employee     Cde                Yes
5   10000  2002.01.01      Employee     Abc                 No
6   10001  1999.03.11      Employee     Zxc                Yes
7   10002  1989.01.01      Employee     Abc                Yes
8   10002  1989.01.01    Contractor     Cde                Yes
9   10002  1988.01.01    Contractor     Zxc                Yes
10  10002  1999.01.01      Employee     Abc                 No