Adding boolean column to pandas dataframe where one row being true should make all same users rows t-CodePudding

I have problems with pandas dataframe when adding a boolean column. Data has users who have projects they can open in several places. I would need to have a group of users who have worked with the same project in several places. If the same user has opened the same project in different places even once it would make shared_projects true. Then all rows with that user_id should be true.

Here is an example df:

user_id   project_id_x   project_id_y
   1           1              2 
   1           3              4  
   2           5              6 
   2           7              7 
   2           8              9
   3           10             11                     
   3           12             10

This is a simple example what I would like to get out. If the condition is true in one line it will be true in all the users with that user_id.

user_id   project_id_x   project_id_y   shared_projects
   1           1              2           false
   1           3              4           false 
   2           5              6           true
   2           7              7           true
   2           8              9           true
   3           10             11          true           
   3           12             10          true

I can get boolean values based on each row but I am stuck how to make it true to all users if it is true on one row.

CodePudding user response：

First you will have to do a complex selection to find the user that have worked in the same project in different columns:

df['shared_projects'] = (df['project_id_x'] == df['project_id_y'])

That will create a new boolean column as you've already done. But then you can use the index of those True values to apply to the rest, assuming that "user_id" is your index for the dataframe.

for index in df[df['shared_projects'] == True]].index.unique():
    df.at[index, 'project_id_x'] = True
    df.at[index, 'project_id_y'] = True

CodePudding user response：

Assuming you want to match on the same row:

df['shared_projects'] = (df['project_id_x'].eq(df['project_id_y'])
                         .groupby(df['user_id']).transform('any')
                        )

If you want to match on any value x/y for a given user, you can use a set intersection:

s = df.groupby('user_id').apply(lambda g: bool(set(g['project_id_x'])
                                              .intersection(g['project_id_y'])))

df.merge(s.rename('shared_project'), left_on='user_id', right_index=True)

output:

   user_id  project_id_x  project_id_y  shared_projects
0        1             1             2            False
1        1             3             4            False
2        2             5             6             True
3        2             7             7             True
4        2             8             9             True

CodePudding user response：

Update

Another approach without apply, using value_counts.

user_id = df.melt('user_id', var_name='project', value_name='project_id') \
            .value_counts(['user_id', 'project_id']) \
            .loc[lambda x: x > 1].index.get_level_values('user_id')
df['shared_projects'] = df['user_id'].isin(user_id)

Output:

>>> df
user_id   project_id_x   project_id_y
   1           1              2 
   1           3              4  
   2           5              6 
   2           7              7 
   2           8              9

# Intermediate result
>>> df.melt('user_id', var_name='project', value_name='project_id') \
            .value_counts(['user_id', 'project_id'])

user_id  project_id
2        7             2  # <- project 7 in multiple places for user 2
1        1             1
         2             1
         3             1
         4             1
2        5             1
         6             1
         8             1
         9             1
dtype: int64

Old answer

You can use melt:

shared_projects = lambda x: len(set(x)) != len(x)
user_id = df.melt('user_id').groupby('user_id')['value'].apply(shared_projects)
df['shared_projects'] = df['user_id'].isin(user_id[user_id].index)

Output:

>>> df
   user_id  project_id_x  project_id_y  shared_projects
0        1             1             2            False
1        1             3             4            False
2        2             5             6             True
3        2             7             7             True
4        2             8             9             True