Filter out some group that don’t contain specific case of a column in python-CodePudding

I am new to python and I have a question about how to filter out some group (group by ‘date’ ) that don’t contain any female case. Suppose I have a dataframe as below:

import pandas as pd
import numpy as np

exam_data  = {
        'date': ['2000-1', '2000-1', '2000-1', '2000-2', '2000-2', '2000-3', '2000-4', '2000-4', '2000-5', '2000-5','2000-5'],
        'sex': ['M', 'M', 'F', 'M', 'M', 'F', 'M', 'M', 'F', 'F', 'M'],
        'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes', 'no']}
df_exam = pd.DataFrame(exam_data)

I have tried this it seems correct but I feel like it is too specific as I used string contain 'F':

df_filter = df_exam.groupby("date").filter(lambda gr: gr["sex"].str.contains("F").any())

Are there any other algorithms more efficient than this or maybe more practice than this?

Thank you so much for your comments.

CodePudding user response：

For better performance never use filter, better is create boolean mask by GroupBy.transform with mask and filter in boolean indexing:

df_filter = df_exam[df_exam["sex"].eq("F").groupby(df_exam["date"]).transform('any')]

Or alternative with also good, depends of data if better like transform - get all dates by condition and filter in boolean indexing with Series.isin:

dates = df_exam.loc[df_exam["sex"].eq("F"), 'date'].unique()

df_filter = df_exam[df_exam["date"].isin(dates)]

CodePudding user response：

Tests are a definitive way to verify if an algorithm is more efficient; however, at a glance, if you can avoid the anonymous function(lambda), you should theoretically get more performance:

df_exam.loc[df_exam.sex.eq('F').groupby(df_exam.date).transform('any')]
Out[32]: 
      date sex qualify
0   2000-1   M     yes
1   2000-1   M      no
2   2000-1   F     yes
5   2000-3   F     yes
8   2000-5   F      no
9   2000-5   F     yes
10  2000-5   M      no

What the code above is grouping one series by another series (pandas index alignment kicks in here before the grouping); the transform keeps the result as the same length as the original dataframe