I am given a (pandas) dataframe telling me about membership relations of people and clubs. What I want to find is the number of members that any two clubs have in common.
Example Input:
Person Club
1 A
1 B
1 C
2 A
2 C
3 A
3 B
4 C
In other words, A = {1,2,3}, B = {1,3}, and C = {1,2,4}.
Desired output:
Club 1 Club 2 Num_Overlaps
A B 1
A C 2
B C 1
I can of course write python code that calculates those numbers, but I guess there must be a more dataframe-ish way using groupby or so to accomplish the same.
CodePudding user response:
First, I grouped the dataframe on the club to get a set of each person in the club.
grouped = df.groupby("Club").agg({"Person": set}).reset_index()
Club Person
0 A {1, 2, 3}
1 B {1, 3}
2 C {1, 2, 4}
Then, I created a Cartesian product of this dataframe. I didn't have pandas 1.2.0, so I couldn't use the cross join available in df.merge()
. Instead, I used the idea from this answer: pandas two dataframe cross join
grouped["key"] = 0
product = grouped.merge(grouped, on="key", how="outer").drop(columns="key")
Club_x Person_x Club_y Person_y
0 A {1, 2, 3} A {1, 2, 3}
1 A {1, 2, 3} B {1, 3}
2 A {1, 2, 3} C {1, 2, 4}
3 B {1, 3} A {1, 2, 3}
4 B {1, 3} B {1, 3}
5 B {1, 3} C {1, 2, 4}
6 C {1, 2, 4} A {1, 2, 3}
7 C {1, 2, 4} B {1, 3}
8 C {1, 2, 4} C {1, 2, 4}
I then filtered out pairs where Club_x < Club_y
so it removes duplicate pairs.
filtered = product[product["Club_x"] < product["Club_y"]]
Club_x Person_x Club_y Person_y
1 A {1, 2, 3} B {1, 3}
2 A {1, 2, 3} C {1, 2, 4}
5 B {1, 3} C {1, 2, 4}
Finally, I added the column with the overlap size and renamed columns as necessary.
result = filtered.assign(Num_Overlaps=filtered.apply(lambda row: len(row["Person_x"].intersection(row["Person_y"])), axis=1))
result = result.rename(columns={"Club_x": "Club 1", "Club_y": "Club 2"}).drop(["Person_x", "Person_y"], axis=1)
Club 1 Club 2 Num_Overlaps
1 A B 2
2 A C 2
5 B C 1
CodePudding user response:
You can indeed do this with groupby
and some set manipulation. I would also use itertools.combinations
, to get the list of club pairs.
import pandas as pd
from itertools import combinations
df = pd.DataFrame({'Person': [1, 1, 1, 2, 2, 3, 3, 4],
'Club': list('ABCACABC')})
members = df.groupby('Club').agg(set)
clubs = sorted(list(set(df.Club)))
overlap = pd.DataFrame(list(combinations(clubs, 2)),
columns=['Club 1', 'Club 2'])
def n_overlap(row):
club1, club2 = row
members1 = members.loc[club1, 'Person']
members2 = members.loc[club2, 'Person']
return len(members1.intersection(members2))
overlap['Num_Overlaps'] = overlap.apply(n_overlap, axis=1)
overlap
Club 1 Club 2 Num_Overlaps
0 A B 2
1 A C 2
2 B C 1
Note there is one difference to your desired output, but that is probably as it should be, as noted by @rchome in the comment above.