I am new to python and I have a question about how to filter out some group (group by ‘date’ ) that don’t contain any female case. Suppose I have a dataframe as below:
import pandas as pd
import numpy as np
exam_data = {
'date': ['2000-1', '2000-1', '2000-1', '2000-2', '2000-2', '2000-3', '2000-4', '2000-4', '2000-5', '2000-5','2000-5'],
'sex': ['M', 'M', 'F', 'M', 'M', 'F', 'M', 'M', 'F', 'F', 'M'],
'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes', 'no']}
df_exam = pd.DataFrame(exam_data)
I have tried this it seems correct but I feel like it is too specific as I used string contain 'F':
df_filter = df_exam.groupby("date").filter(lambda gr: gr["sex"].str.contains("F").any())
Are there any other algorithms more efficient than this or maybe more practice than this?
Thank you so much for your comments.
CodePudding user response:
For better performance never use filter
, better is create boolean mask by GroupBy.transform
with mask and filter in boolean indexing
:
df_filter = df_exam[df_exam["sex"].eq("F").groupby(df_exam["date"]).transform('any')]
Or alternative with also good, depends of data if better like transform
- get all date
s by condition and filter in boolean indexing
with Series.isin
:
dates = df_exam.loc[df_exam["sex"].eq("F"), 'date'].unique()
df_filter = df_exam[df_exam["date"].isin(dates)]
CodePudding user response:
Tests are a definitive way to verify if an algorithm is more efficient; however, at a glance, if you can avoid the anonymous function(lambda), you should theoretically get more performance:
df_exam.loc[df_exam.sex.eq('F').groupby(df_exam.date).transform('any')]
Out[32]:
date sex qualify
0 2000-1 M yes
1 2000-1 M no
2 2000-1 F yes
5 2000-3 F yes
8 2000-5 F no
9 2000-5 F yes
10 2000-5 M no
What the code above is grouping one series by another series (pandas index alignment kicks in here before the grouping); the transform
keeps the result as the same length as the original dataframe