Add yes if within a group, a column contains all element from a list in python-CodePudding

I have a dataframe such as :

The_list=["A","B"]

Groups COL2
G1     A
G1     B
G1     C
G2     A
G2     D
G3     A
G3     B
G4     A
G5     B

and I would like to add another COL3 column where I put yes for each Groups where all element from The_list are present within the COL2, and no if not.

I should then get :

Groups COL2 COL3
G1     A    yes
G1     B    yes
G1     C    yes
G2     A    no
G2     D    no
G3     A    yes
G3     B    yes
G4     A    no
G5     B    no

CodePudding user response：

Use Groupby.apply with set intersection and np.where:

# Group on column 'Groups' and apply set intersection and find length of set
In [1178]: out = df.groupby('Groups')['COL2'].apply(lambda x: len(set(x) & set(The_list))).reset_index()

# If length of set is equal to the length of the list, then 'yes' else 'no'
In [1179]: import numpy as np
In [1180]: out['COL3'] = np.where(out.COL2.eq(len(The_list), 'yes', 'no')

# Merge original df with above and populate COL3
In [1185]: df = df.merge(out[['Groups','COL3']])

In [1186]: df
Out[1186]: 
  Groups COL2 COL3
0     G1    A  yes
1     G1    B  yes
2     G1    C  yes
3     G2    A   no
4     G2    D   no
5     G3    A  yes
6     G3    B  yes
7     G4    A   no
8     G5    B   no

CodePudding user response：

First make sure that Groups is the index. If not:

df = df.set_index("Groups")

Initialize COL3:

df["COL3"] = "yes"

Then make a list of the COL2 values for each group and change COL3 to "no" if an element from the list is missing:

for group in set(df.index):  
    COL2_values = list(df.loc[group, "COL2"])
    
    for element in The_list:
        if element not in COL2_values:
            df.loc[group, "COL3"] = 'no'
            break

CodePudding user response：

import pandas as pd

df = pd.read_clipboard()
df
'''
  Groups COL2
0     G1    A
1     G1    B
2     G1    C
3     G2    A
4     G2    D
5     G3    A
6     G3    B
7     G4    A
8     G5    B
'''

The_list=["A","B"]
tag = df.groupby('Groups').apply(lambda x: all([i in x.COL2.to_list() for i in The_list]))

tag
'''
Groups
G1     True
G2    False
G3     True
G4    False
G5    False
dtype: bool
'''

df.assign(COL3=df.Groups.map(tag).map({True:'yes', False:'no'}))
'''
  Groups COL2 COL3
0     G1    A  yes
1     G1    B  yes
2     G1    C  yes
3     G2    A   no
4     G2    D   no
5     G3    A  yes
6     G3    B  yes
7     G4    A   no
8     G5    B   no
'''