Groupby specific column then assign new values base on conditions-CodePudding

I have a dataframe that looks like this below.

Name    M_Name       Tag        
John                  1
Dave      a           1
Mary                  1      
Sam                   1     
Chris     a           1
John                  2
Nola      f           2
Chuck                 2      
Rob                   2     
Chris     a           2
Angie                 3
Joe                   3
 :        :           : 
Tony                  n

I need to create a new column Tag_2, which is based on Tag column and M_Name column. First, I need to groupby Tag column then if M_Name column have more than 1 different letter, the whole Tag group will have a new value "Invalid" under Tag_2 column. If it only has 1 letter in the M-Name for the same Tag group then it will be "Valid". Here is what the output should look like.

Name    M_Name       Tag          Tag_2       
John                  1           Valid
Dave      a           1           Valid
Mary                  1           Valid           
Sam                   1           Valid     
Chris     a           1           Valid
John                  2          Invalid
Nola      f           2          Invalid          
Chuck                 2          Invalid      
Rob                   2          Invalid     
Chris     a           2          Invalid
Angie                 3           Valid
Joe                   3           Valid
 :        :           : 
Tony                  n           Valid

Tag 2 group is "Invalid" because the M_Name rows has more than 1 different letter (f & a) for the tag 2 group.

I am thinking to use groupby to solve this task but I do not know how to add a condition like this. df.groupby('Tag')['M_Name'] I am wondering am I on the right path using groupby or are ther any better approach that I should consider?

Thanks!

CodePudding user response：

I'm sure there's a nicer way, but here's one solution:

def validate(s):
    return "Invalid" if len(set(s)) > 2 else "Valid"


mapping = df.groupby("Tag")["M_Name"].apply(validate)
df["Tag_2"] = df.loc[:, "Tag"].replace(mapping)

Output:

     Name M_Name  Tag    Tag_2
0    John           1    Valid
1    Dave      a    1    Valid
2    Mary           1    Valid
3     Sam           1    Valid
4   Chris      a    1    Valid
5    John           2  Invalid
6    Nola      f    2  Invalid
7   Chuck           2  Invalid
8     Rob           2  Invalid
9   Chris      a    2  Invalid
10  Angie           3    Valid
11    Joe           3    Valid

CodePudding user response：

First, we group(), then we transform() the M_Name column, where in lambda we filter out empty values x[x.ne('')], then use value_counts() to get the elements frequency, and based on the number of elements, we determine the validity:

df = pd.DataFrame(
    {'Name': ['John', 'Dave', 'Mary', 'Sam', 'Chris', 'John', 'Nola', 'Chuck', 'Rob', 'Chris', 'Angie', 'Joe'],
     'M_Name': ['', 'a', '', '', 'a', '', 'f', '', '', 'a', '', ''], 'Tag': [1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3]})

df['Tag2'] = df.groupby('Tag').M_Name.transform(
    lambda x: 'Invalid' if x[x.ne('')].value_counts().shape[0] > 1 else 'Valid')
print(df)

Output:

     Name M_Name  Tag     Tag2
0    John           1    Valid
1    Dave      a    1    Valid
2    Mary           1    Valid
3     Sam           1    Valid
4   Chris      a    1    Valid
5    John           2  Invalid
6    Nola      f    2  Invalid
7   Chuck           2  Invalid
8     Rob           2  Invalid
9   Chris      a    2  Invalid
10  Angie           3    Valid
11    Joe           3    Valid