I have a dataframe with Country, IDs of the employees and a column that identify if the row is the primary row in case the employee ID appears more than once in the file. The Primary row column is all messed up (with values outside the "Yes" or "No" acceptance) as you can see in this example:
Country ID Primary row
CZ 1 Yes
CZ 1 Yes
DK 3 No
DK 2 No
DK 2
CZ 4 Yes
CZ 4 Yes
CZ 4 No
SK 5 No
SK 5 No
SK 5 No
SK 6 01-01-90
SK 6
SK 7
SK 7
CZ 8 #
I would need a solution to do the following:
For the IDs appearing just once in the file, "Primary row" value should be always set to "Yes"
For the IDs appearing more than once in the file:
- If all primary rows are blank, select at random one to be set as "Yes" and the others "No".
- If the primary row has more than 1 "Yes", select at random one to keep as "Yes" and set the others to "No".
- If the primary row has all "No", select at random one to be "Yes" and the others to remain "No".
- If one of the primary row has a "No", fill a blank value to "Yes" and set the others to "No" if there is other blanks.
The output expected would be:
Country ID Primary row
CZ 1 Yes
CZ 1 No
DK 3 Yes
DK 2 No
DK 2 Yes
CZ 4 Yes
CZ 4 No
CZ 4 No
SK 5 Yes
SK 5 No
SK 5 No
SK 6 Yes
SK 6 No
SK 7 Yes
SK 7 No
CZ 8 Yes
What would be the best way to achieve this result? Thank you!
CodePudding user response:
You should use a custom function with groupby.apply
:
def fix(s):
s2 = s[s.eq('Yes')] # only "Yes"
s3 = s[s.ne('No')] # only "No"
if len(s2): # if "Yes" are available, pick one to keep
idx = s2.sample(n=1).index[0]
elif len(s3): # if no "Yes", pick a non-"No"
idx = s3.sample(n=1).index[0]
else: # else, we only have "No", pick one
idx = s.sample(n=1).index[0]
out = pd.Series('No', index=s.index)
out[idx] = 'Yes'
return out
df['Primary row 2'] = df.groupby(['Country', 'ID'])['Primary row'].apply(fix)
output:
Country ID Primary row Primary row 2
0 CZ 1 Yes Yes
1 CZ 1 Yes No
2 DK 3 No Yes
3 DK 2 No No
4 DK 2 None Yes
5 CZ 4 Yes Yes
6 CZ 4 Yes No
7 CZ 4 No No
8 SK 5 No No
9 SK 5 No Yes
10 SK 5 No No
11 SK 6 01-01-90 No
12 SK 6 None Yes
13 SK 7 None No
14 SK 7 None Yes
15 CZ 8 # Yes