How to filter out multiple rows in a pandas.DataFrame based on multiple conditions for the same colu-CodePudding

I have an exemplary pd.DataFrame containing codenames of software developed in different development studios:

df = pd.DataFrame({'project_id': [36423, 28564, 96648, 96648, 10042, 68277, 68277, 68277], 'codename': ['banana', 'apple', 'peach', 'peach', 'melon', 'pear', 'pear', 'pear'], 'studio': ['paris', 'amsterdam', 'frankfurt', 'paris', 'london', 'brussel', 'amsterdam', 'sofia']})

      id codename     studio
0  36423   banana      paris
1  28564    apple  amsterdam
2  96648    peach  frankfurt
3  96648    peach      paris
4  10042    melon     london
5  68277     pear    brussel
6  68277     pear  amsterdam
7  68277     pear      sofia

What would be the best way to filter out these rows which hold projects developed

in at least two different studios?
in two specific studios?

The results I am trying to achieve look like as follows:

Which projects are getting developed in at least two different studios:

   project_id codename     studio
0       96648    peach  frankfurt
1       96648    peach      paris
2       68277     pear    brussel
3       68277     pear  amsterdam
4       68277     pear      sofia

Which projects are getting developed in frankfurt AND paris?

   project_id codename     studio
0       96648    peach  frankfurt
1       96648    peach      paris

Using df.loc[df['studio'].isin(['frankfurt', 'paris'])] for instance does not work, as this function filters out all rows which contain either frankfurt or paris in the column studio. Is there a more elegant way than filtering the dataframe for frankfurt and paris and using the Series.intersection() method? I am running out of Ideas here.

Thanks in advance! :)

CodePudding user response：

For the first question:

df[df.groupby('project_id')['studio'].transform('nunique').ge(2)]

output:

   project_id codename     studio
2       96648    peach  frankfurt
3       96648    peach      paris
5       68277     pear    brussel
6       68277     pear  amsterdam
7       68277     pear      sofia

For the second:

df[df.groupby('project_id')['studio']
     .transform(lambda x: set(x)=={'frankfurt', 'paris'})]
# if you want at least frankfurt paris, use
# set(x)>={'frankfurt', 'paris'})

output:

   project_id codename     studio
2       96648    peach  frankfurt
3       96648    peach      paris

CodePudding user response：

I like to solve this with indices:

The first one:

idxs_min_2_studios = df.apply(lambda x:  (df.project_id == x.project_id).sum() , axis=1) >= 2
print(df[idxs_min_2_studios])

   project_id codename     studio
2       96648    peach  frankfurt
3       96648    peach      paris
5       68277     pear    brussel
6       68277     pear  amsterdam
7       68277     pear      sofia

The second one:

reqd_studios = {'amsterdam', 'sofia'}
idx_reqd_studios = df.apply(lambda x:  reqd_studios.issubset(set(df[df.project_id == x.project_id].studio))    ,axis=1) 
print(df[idx_reqd_studios])

   project_id codename     studio
5       68277     pear    brussel
6       68277     pear  amsterdam
7       68277     pear      sofia