I have an exemplary pd.DataFrame containing codenames of software developed in different development studios:
df = pd.DataFrame({'project_id': [36423, 28564, 96648, 96648, 10042, 68277, 68277, 68277], 'codename': ['banana', 'apple', 'peach', 'peach', 'melon', 'pear', 'pear', 'pear'], 'studio': ['paris', 'amsterdam', 'frankfurt', 'paris', 'london', 'brussel', 'amsterdam', 'sofia']})
id codename studio
0 36423 banana paris
1 28564 apple amsterdam
2 96648 peach frankfurt
3 96648 peach paris
4 10042 melon london
5 68277 pear brussel
6 68277 pear amsterdam
7 68277 pear sofia
What would be the best way to filter out these rows which hold projects developed
- in at least two different studios?
- in two specific studios?
The results I am trying to achieve look like as follows:
Which projects are getting developed in at least two different studios:
project_id codename studio
0 96648 peach frankfurt
1 96648 peach paris
2 68277 pear brussel
3 68277 pear amsterdam
4 68277 pear sofia
Which projects are getting developed in frankfurt AND paris?
project_id codename studio
0 96648 peach frankfurt
1 96648 peach paris
Using df.loc[df['studio'].isin(['frankfurt', 'paris'])]
for instance does not work, as this function filters out all rows which contain either frankfurt
or paris
in the column studio
. Is there a more elegant way than filtering the dataframe for frankfurt
and paris
and using the Series.intersection()
method? I am running out of Ideas here.
Thanks in advance! :)
CodePudding user response:
For the first question:
df[df.groupby('project_id')['studio'].transform('nunique').ge(2)]
output:
project_id codename studio
2 96648 peach frankfurt
3 96648 peach paris
5 68277 pear brussel
6 68277 pear amsterdam
7 68277 pear sofia
For the second:
df[df.groupby('project_id')['studio']
.transform(lambda x: set(x)=={'frankfurt', 'paris'})]
# if you want at least frankfurt paris, use
# set(x)>={'frankfurt', 'paris'})
output:
project_id codename studio
2 96648 peach frankfurt
3 96648 peach paris
CodePudding user response:
I like to solve this with indices:
The first one:
idxs_min_2_studios = df.apply(lambda x: (df.project_id == x.project_id).sum() , axis=1) >= 2
print(df[idxs_min_2_studios])
project_id codename studio
2 96648 peach frankfurt
3 96648 peach paris
5 68277 pear brussel
6 68277 pear amsterdam
7 68277 pear sofia
The second one:
reqd_studios = {'amsterdam', 'sofia'}
idx_reqd_studios = df.apply(lambda x: reqd_studios.issubset(set(df[df.project_id == x.project_id].studio)) ,axis=1)
print(df[idx_reqd_studios])
project_id codename studio
5 68277 pear brussel
6 68277 pear amsterdam
7 68277 pear sofia