I have two pandas DataFrames. The first one, df1
, contains a column of file paths and a column of lists containing what users have read access to these file paths. The second DataFrame, df2
, contains a list of all possible users. I've created an example below:
df1 = pd.DataFrame()
df1['path'] = ['C:/pathA', 'C:/pathB', 'C:/pathC', 'C:/pathD']
df1['read'] = [['userA', 'userC', 'userD'],
['userA', 'userB'],
['userB', 'userD'],
['userA', 'userB', 'userC', 'userD']]
print(df1)
path read
0 C:/pathA [userA, userC, userD]
1 C:/pathB [userA, userB]
2 C:/pathC [userB, userD]
3 C:/pathD [userA, userB, userC, userD]
df2 = pd.DataFrame(data=['userA', 'userB', 'userC', 'userD'], columns=['user'])
print(df2)
user
0 userA
1 userB
2 userC
3 userD
The end goal is to create a new column df2['read_count']
, which should take each user string from df2['user']
and find the total number of matches in the column df1['read']
.
The expected output would be exactly that - a count of matches of each user string in the column of lists in df1['read']
. Here is what I am expecting based on the example:
df2
user read_count
0 userA 3
1 userB 3
2 userC 2
3 userD 3
I tried putting something together using another question and list comprehension, but no luck. Here is what I currently have:
df2['read_count'] = [sum(all(val in cell for val in row)
for cell in df1['read'])
for row in df2['user']]
print(df2)
user read_count
0 userA 0
1 userB 0
2 userC 0
3 userD 0
What is wrong with the code I currently have? I've tried actually following through the loops but it all seemed right, but it seems like my code can't detect the matches I want.
CodePudding user response:
You can use:
df2.merge(df1['read'].explode().value_counts(),
left_on='user', right_index=True)
Or, if you really need to use a different kind of aggregation that depends on "path":
df2.merge(df1.explode('read').groupby('read')['path'].count(),
left_on='user', right_index=True)
output:
user path
0 userA 3
1 userB 3
2 userC 2
3 userD 3
Without df2
:
df1['read'].explode().value_counts().reset_index()
# or
# df1.explode('read').groupby('read', as_index=False)['path'].count()
output:
index path
0 userA 3
1 userB 3
2 userC 2
3 userD 3
CodePudding user response:
Let's use np.hstack
to flatten the lists then count the unique values using Counter
and map
the counts back to df2
from collections import Counter
df2['count'] = df2['user'].map(Counter(np.hstack(df1.read)))
Result
user count
0 userA 3
1 userB 3
2 userC 2
3 userD 3