Home > database >  Fill pandas dataframe column based on duplication of another column
Fill pandas dataframe column based on duplication of another column

Time:08-24

I have a dataframe with Country, IDs of the employees and a column that identify if the row is the primary row in case the employee ID appears more than once in the file. The Primary row column is all messed up (with values outside the "Yes" or "No" acceptance) as you can see in this example:

Country ID  Primary row
CZ  1   Yes
CZ  1   Yes
DK  3   No
DK  2   No
DK  2   
CZ  4   Yes
CZ  4   Yes
CZ  4   No
SK  5   No
SK  5   No
SK  5   No
SK  6   01-01-90
SK  6   
SK  7   
SK  7   
CZ  8   #

I would need a solution to do the following:

  • For the IDs appearing just once in the file, "Primary row" value should be always set to "Yes"

  • For the IDs appearing more than once in the file:

    • If all primary rows are blank, select at random one to be set as "Yes" and the others "No".
    • If the primary row has more than 1 "Yes", select at random one to keep as "Yes" and set the others to "No".
    • If the primary row has all "No", select at random one to be "Yes" and the others to remain "No".
    • If one of the primary row has a "No", fill a blank value to "Yes" and set the others to "No" if there is other blanks.

The output expected would be:

Country ID  Primary row
CZ  1   Yes
CZ  1   No
DK  3   Yes
DK  2   No
DK  2   Yes
CZ  4   Yes
CZ  4   No
CZ  4   No
SK  5   Yes
SK  5   No
SK  5   No
SK  6   Yes
SK  6   No
SK  7   Yes
SK  7   No
CZ  8   Yes

What would be the best way to achieve this result? Thank you!

CodePudding user response:

You should use a custom function with groupby.apply:

def fix(s):
    s2 = s[s.eq('Yes')]  # only "Yes"
    s3 = s[s.ne('No')]   # only "No"
    if len(s2): # if "Yes" are available, pick one to keep
        idx = s2.sample(n=1).index[0]
    elif len(s3): # if no "Yes", pick a non-"No"
        idx = s3.sample(n=1).index[0]
    else: # else, we only have "No", pick one
        idx = s.sample(n=1).index[0]
    out = pd.Series('No', index=s.index)
    out[idx] = 'Yes'
    return out

df['Primary row 2'] = df.groupby(['Country', 'ID'])['Primary row'].apply(fix)

output:

   Country  ID Primary row Primary row 2
0       CZ   1         Yes           Yes
1       CZ   1         Yes            No
2       DK   3          No           Yes
3       DK   2          No            No
4       DK   2        None           Yes
5       CZ   4         Yes           Yes
6       CZ   4         Yes            No
7       CZ   4          No            No
8       SK   5          No            No
9       SK   5          No           Yes
10      SK   5          No            No
11      SK   6    01-01-90            No
12      SK   6        None           Yes
13      SK   7        None            No
14      SK   7        None           Yes
15      CZ   8           #           Yes
  • Related