Home > other >  How to filter out multiple rows in a pandas.DataFrame based on multiple conditions for the same colu
How to filter out multiple rows in a pandas.DataFrame based on multiple conditions for the same colu

Time:05-17

I have an exemplary pd.DataFrame containing codenames of software developed in different development studios:

df = pd.DataFrame({'project_id': [36423, 28564, 96648, 96648, 10042, 68277, 68277, 68277], 'codename': ['banana', 'apple', 'peach', 'peach', 'melon', 'pear', 'pear', 'pear'], 'studio': ['paris', 'amsterdam', 'frankfurt', 'paris', 'london', 'brussel', 'amsterdam', 'sofia']})
      id codename     studio
0  36423   banana      paris
1  28564    apple  amsterdam
2  96648    peach  frankfurt
3  96648    peach      paris
4  10042    melon     london
5  68277     pear    brussel
6  68277     pear  amsterdam
7  68277     pear      sofia

What would be the best way to filter out these rows which hold projects developed

  1. in at least two different studios?
  2. in two specific studios?

The results I am trying to achieve look like as follows:

Which projects are getting developed in at least two different studios:

   project_id codename     studio
0       96648    peach  frankfurt
1       96648    peach      paris
2       68277     pear    brussel
3       68277     pear  amsterdam
4       68277     pear      sofia

Which projects are getting developed in frankfurt AND paris?

   project_id codename     studio
0       96648    peach  frankfurt
1       96648    peach      paris

Using df.loc[df['studio'].isin(['frankfurt', 'paris'])] for instance does not work, as this function filters out all rows which contain either frankfurt or paris in the column studio. Is there a more elegant way than filtering the dataframe for frankfurt and paris and using the Series.intersection() method? I am running out of Ideas here.

Thanks in advance! :)

CodePudding user response:

For the first question:

df[df.groupby('project_id')['studio'].transform('nunique').ge(2)]

output:

   project_id codename     studio
2       96648    peach  frankfurt
3       96648    peach      paris
5       68277     pear    brussel
6       68277     pear  amsterdam
7       68277     pear      sofia

For the second:

df[df.groupby('project_id')['studio']
     .transform(lambda x: set(x)=={'frankfurt', 'paris'})]
# if you want at least frankfurt paris, use
# set(x)>={'frankfurt', 'paris'})

output:

   project_id codename     studio
2       96648    peach  frankfurt
3       96648    peach      paris

CodePudding user response:

I like to solve this with indices:

The first one:

idxs_min_2_studios = df.apply(lambda x:  (df.project_id == x.project_id).sum() , axis=1) >= 2
print(df[idxs_min_2_studios])

   project_id codename     studio
2       96648    peach  frankfurt
3       96648    peach      paris
5       68277     pear    brussel
6       68277     pear  amsterdam
7       68277     pear      sofia

The second one:

reqd_studios = {'amsterdam', 'sofia'}
idx_reqd_studios = df.apply(lambda x:  reqd_studios.issubset(set(df[df.project_id == x.project_id].studio))    ,axis=1) 
print(df[idx_reqd_studios])

   project_id codename     studio
5       68277     pear    brussel
6       68277     pear  amsterdam
7       68277     pear      sofia
  • Related