I have problems with pandas dataframe when adding a boolean column. Data has users who have projects they can open in several places. I would need to have a group of users who have worked with the same project in several places. If the same user has opened the same project in different places even once it would make shared_projects true. Then all rows with that user_id should be true.
Here is an example df:
user_id project_id_x project_id_y
1 1 2
1 3 4
2 5 6
2 7 7
2 8 9
3 10 11
3 12 10
This is a simple example what I would like to get out. If the condition is true in one line it will be true in all the users with that user_id.
user_id project_id_x project_id_y shared_projects
1 1 2 false
1 3 4 false
2 5 6 true
2 7 7 true
2 8 9 true
3 10 11 true
3 12 10 true
I can get boolean values based on each row but I am stuck how to make it true to all users if it is true on one row.
CodePudding user response:
First you will have to do a complex selection to find the user that have worked in the same project in different columns:
df['shared_projects'] = (df['project_id_x'] == df['project_id_y'])
That will create a new boolean column as you've already done. But then you can use the index of those True values to apply to the rest, assuming that "user_id" is your index for the dataframe.
for index in df[df['shared_projects'] == True]].index.unique():
df.at[index, 'project_id_x'] = True
df.at[index, 'project_id_y'] = True
CodePudding user response:
Assuming you want to match on the same row:
df['shared_projects'] = (df['project_id_x'].eq(df['project_id_y'])
.groupby(df['user_id']).transform('any')
)
If you want to match on any value x/y for a given user, you can use a set
intersection:
s = df.groupby('user_id').apply(lambda g: bool(set(g['project_id_x'])
.intersection(g['project_id_y'])))
df.merge(s.rename('shared_project'), left_on='user_id', right_index=True)
output:
user_id project_id_x project_id_y shared_projects
0 1 1 2 False
1 1 3 4 False
2 2 5 6 True
3 2 7 7 True
4 2 8 9 True
CodePudding user response:
Update
Another approach without apply
, using value_counts
.
user_id = df.melt('user_id', var_name='project', value_name='project_id') \
.value_counts(['user_id', 'project_id']) \
.loc[lambda x: x > 1].index.get_level_values('user_id')
df['shared_projects'] = df['user_id'].isin(user_id)
Output:
>>> df
user_id project_id_x project_id_y
1 1 2
1 3 4
2 5 6
2 7 7
2 8 9
# Intermediate result
>>> df.melt('user_id', var_name='project', value_name='project_id') \
.value_counts(['user_id', 'project_id'])
user_id project_id
2 7 2 # <- project 7 in multiple places for user 2
1 1 1
2 1
3 1
4 1
2 5 1
6 1
8 1
9 1
dtype: int64
Old answer
You can use melt
:
shared_projects = lambda x: len(set(x)) != len(x)
user_id = df.melt('user_id').groupby('user_id')['value'].apply(shared_projects)
df['shared_projects'] = df['user_id'].isin(user_id[user_id].index)
Output:
>>> df
user_id project_id_x project_id_y shared_projects
0 1 1 2 False
1 1 3 4 False
2 2 5 6 True
3 2 7 7 True
4 2 8 9 True