Home > Software design >  Filter out some group that don’t contain specific case of a column in python
Filter out some group that don’t contain specific case of a column in python

Time:11-03

I am new to python and I have a question about how to filter out some group (group by ‘date’ ) that don’t contain any female case. Suppose I have a dataframe as below:

import pandas as pd
import numpy as np

exam_data  = {
        'date': ['2000-1', '2000-1', '2000-1', '2000-2', '2000-2', '2000-3', '2000-4', '2000-4', '2000-5', '2000-5','2000-5'],
        'sex': ['M', 'M', 'F', 'M', 'M', 'F', 'M', 'M', 'F', 'F', 'M'],
        'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes', 'no']}
df_exam = pd.DataFrame(exam_data)

I have tried this it seems correct but I feel like it is too specific as I used string contain 'F':

df_filter = df_exam.groupby("date").filter(lambda gr: gr["sex"].str.contains("F").any())

Are there any other algorithms more efficient than this or maybe more practice than this?

Thank you so much for your comments.

CodePudding user response:

For better performance never use filter, better is create boolean mask by GroupBy.transform with mask and filter in boolean indexing:

df_filter = df_exam[df_exam["sex"].eq("F").groupby(df_exam["date"]).transform('any')]

Or alternative with also good, depends of data if better like transform - get all dates by condition and filter in boolean indexing with Series.isin:

dates = df_exam.loc[df_exam["sex"].eq("F"), 'date'].unique()

df_filter = df_exam[df_exam["date"].isin(dates)]

CodePudding user response:

Tests are a definitive way to verify if an algorithm is more efficient; however, at a glance, if you can avoid the anonymous function(lambda), you should theoretically get more performance:

df_exam.loc[df_exam.sex.eq('F').groupby(df_exam.date).transform('any')]
Out[32]: 
      date sex qualify
0   2000-1   M     yes
1   2000-1   M      no
2   2000-1   F     yes
5   2000-3   F     yes
8   2000-5   F      no
9   2000-5   F     yes
10  2000-5   M      no

What the code above is grouping one series by another series (pandas index alignment kicks in here before the grouping); the transform keeps the result as the same length as the original dataframe

  • Related