Filter rows from a grouped data frame based on string columns-CodePudding

I have a data frame grouped by multiple columns but in this example it would be grouped only by Year.

   Year Animal1  Animal2
0  2002    Dog   Mouse,Lion
1  2002  Mouse            
2  2002   Lion            
3  2002   Duck            
4  2010    Dog   Cat
5  2010    Cat            
6  2010   Lion            
7  2010  Mouse

I would like for each group, from the rows where Animal2 is empty to filter out the rows where Animal2 does not appear in the column Animal1.

The expected output would be:

  Year Animal1   Animal2
0  2002    Dog   Mouse,Lion
1  2002  Mouse            
2  2002   Lion                   
3  2010    Dog   Cat
4  2010    Cat

Rows 0 & 3 stayed since Animal2 is not empty.

Rows 1 & 2 stayed since Mouse & Lion are in Animal2 for the first group.

Row 4 stayed since cat appear in Animal2 for the second group

CodePudding user response：

You can use masks and regexes:

# non empty Animal2
m1 = df['Animal2'].notna()

# make patterns with those Animals2 per Year
patterns = df[m1].groupby('Year')['Animal2'].agg('|'.join).str.replace(',', '|')

# for each Year select with the matching regex
m2 = (df.groupby('Year', group_keys=False)['Animal1']
        .apply(lambda g: g.str.fullmatch(patterns[g.name]))
     )

out = df.loc[m1|m2]

Or sets:

m1 = df['Animal2'].notna()

sets = (df.loc[m1, 'Animal2'].str.split(',')
          .groupby(df['Year'])
          .agg(lambda x: set().union(*x))
       )

m2 = (df.groupby('Year', group_keys=False)['Animal1']
        .apply(lambda g: g.isin(sets[g.name]))
     )

out = df.loc[m1|m2]

Output:

   Year Animal1     Animal2
0  2002     Dog  Mouse,Lion
1  2002   Mouse        None
2  2002    Lion        None
4  2010     Dog         Cat
5  2010     Cat        None

CodePudding user response：

Here is a solution using list comprehension

(df.loc[
    [a1 in a2 for a1,a2 in zip(df['Animal1'],df['Year'].map(df['Animal2'].str.split(',').groupby(df['Year']).sum()))] | 
    df['Animal2'].notna()]
    )

d = df['Animal2'].str.split(',').groupby(df['Year']).sum()

(df.loc[df.groupby('Year')['Animal1'].transform(lambda x: x.isin(d.loc[x.name])) | 
df['Animal2'].notna()]
)

Output:

   Year Animal1     Animal2
0  2002     Dog  Mouse,Lion
1  2002   Mouse        None
2  2002    Lion        None
4  2010     Dog         Cat
5  2010     Cat        None