Imagine I have a dataframe with employee IDs, their Contract Number, and the Company they work for. Each employee can have as many contracts as they want for the same company or even for different companies:
ID Contract Number Company
10000 1 Abc
10000 2 Zxc
10000 3 Abc
10001 1 Zxc
10002 2 Abc
10002 1 Cde
10002 3 Zxc
I need to find a way to identify the company of the contract number "1" per each ID and then create a column "Primary Contract" that would be set to "Yes" if the contract is in the same company as the company of contract number 1 resulting on this dataframe:
ID Contract Number Company Primary Compay
10000 1 Abc Yes
10000 2 Zxc No
10000 3 Abc Yes
10001 1 Zxc Yes
10002 2 Abc No
10002 1 Cde Yes
10002 3 Zxc No
What would be the best way to achieve it?
CodePudding user response:
You can use groupby.apply
with isin
and numpy.where
:
df['Primary Company'] = np.where(
df.groupby('ID', group_keys=False)
.apply(lambda g: g['Company'].isin(g.loc[g['Contract Number'].eq(1), 'Company'])
),
'Yes', 'No'
)
Output:
ID Contract Number Company Primary Company
0 10000 1 Abc Yes
1 10000 2 Zxc No
2 10000 3 Abc Yes
3 10001 1 Zxc Yes
4 10002 2 Abc No
5 10002 1 Cde Yes
6 10002 3 Zxc No
If you can just use a boolean (True
/False
) instead of 'Yes'
/'No'
:
df['Primary Company'] = (
df.groupby('ID', group_keys=False)
.apply(lambda g: g['Company'].isin(g.loc[g['Contract Number'].eq(1), 'Company']))
)
CodePudding user response:
Filter rows with Contract Number
is 1
, use left join in DataFrame.merge
and compare _merge
column generated by indicator=True
parameter:
mask = (df.merge(df[df['Contract Number'].eq(1)],
how='left', on=['ID','Company'], indicator=True)['_merge'].eq('both'))
df['Primary Company'] = np.where(mask, 'Yes','No')
print (df)
ID Contract Number Company Primary Company
0 10000 1 Abc Yes
1 10000 2 Zxc No
2 10000 3 Abc Yes
3 10001 1 Zxc Yes
4 10002 2 Abc No
5 10002 1 Cde Yes
6 10002 3 Zxc No
Another idea is with compare MultiIndex
by Index.isin
:
idx = df[df['Contract Number'].eq(1)].set_index(['ID','Company']).index
df['Primary Company'] = np.where(df.set_index(['ID','Company']).index.isin(idx),
'Yes','No')
print (df)
ID Contract Number Company Primary Company
0 10000 1 Abc Yes
1 10000 2 Zxc No
2 10000 3 Abc Yes
3 10001 1 Zxc Yes
4 10002 2 Abc No
5 10002 1 Cde Yes
6 10002 3 Zxc No