DataFrame selecting users that match a condition in all categories-CodePudding

I have the following DataFrame:

   user category  x  y
0    AB        A  1  1
1    EF        A  1  1
2    SG        A  1  0
3    MN        A  1  0
4    AB        B  0  0
5    EF        B  0  1
6    SG        B  0  1
7    MN        B  0  0
8    AB        C  1  1
9    EF        C  1  1
10   SG        C  1  1
11   MN        C  1  1

I want to select users that have x=y in all categories. I was able to do that using the following code:

data = pd.DataFrame({'user': ['AB', 'EF', 'SG', 'MN', 'AB', 'EF', 
                              'SG', 'MN', 'AB', 'EF', 'SG', 'MN'],
                     'category': ['A', 'A', 'A', 'A', 'B', 'B', 
                                  'B', 'B', 'C', 'C', 'C', 'C'],
                     'x': [1,1,1,1, 0,0,0,0, 1,1,1,1],
                     'y': [1,1,0,0, 0,1,1,0, 1,1,1,1]})

data = data[data['x'] == data['y']][['user', 'category']]
count_users_match = data.groupby('user', as_index=False).count()
count_cat = data['category'].unique().shape[0]
print(count_users_match[count_users_match['category'] == count_cat])

Output:

  user  category
0   AB         3

I felt that this is a quite long solution. Is there any shorter way to achieve this?

CodePudding user response：

Try this:

filtered = df.x.eq(df.y).groupby(df['user']).sum().loc[lambda x: x == df['category'].nunique()].reset_index(name='category')

Output:

>>> filtered
  user  category
0   AB         3

CodePudding user response：

We could use query groupby size to find the number of matching categories for each user. Then compare it with the number of categories for each user:

tmp = data.query('x==y').groupby('user').size()
out = tmp[tmp == data['category'].nunique()].reset_index(name='category')

Output:

  user  category
0   AB         3

CodePudding user response：

This is a more compact way to do it, but I don't know if it is also more efficient.

out = [{'user': user, 'frequency': data.loc[data['x'] == data['y']]['user'].value_counts()[user]} for user in data['user'].unique() if data.loc[data['x'] == data['y']]['user'].value_counts()[user] == data['user'].value_counts()[user]]
>>> out
[{'user': 'AB', 'frequency': 3}]