Home > Blockchain >  pandas: Find overlap of clubs
pandas: Find overlap of clubs

Time:12-17

I am given a (pandas) dataframe telling me about membership relations of people and clubs. What I want to find is the number of members that any two clubs have in common.

Example Input:

Person Club
1      A
1      B
1      C
2      A
2      C
3      A
3      B
4      C

In other words, A = {1,2,3}, B = {1,3}, and C = {1,2,4}.

Desired output:

Club 1  Club 2  Num_Overlaps
A       B       1
A       C       2
B       C       1

I can of course write python code that calculates those numbers, but I guess there must be a more dataframe-ish way using groupby or so to accomplish the same.

CodePudding user response:

First, I grouped the dataframe on the club to get a set of each person in the club.

grouped = df.groupby("Club").agg({"Person": set}).reset_index()

  Club     Person
0    A  {1, 2, 3}
1    B     {1, 3}
2    C  {1, 2, 4}

Then, I created a Cartesian product of this dataframe. I didn't have pandas 1.2.0, so I couldn't use the cross join available in df.merge(). Instead, I used the idea from this answer: pandas two dataframe cross join

grouped["key"] = 0
product = grouped.merge(grouped, on="key", how="outer").drop(columns="key")

  Club_x   Person_x Club_y   Person_y
0      A  {1, 2, 3}      A  {1, 2, 3}
1      A  {1, 2, 3}      B     {1, 3}
2      A  {1, 2, 3}      C  {1, 2, 4}
3      B     {1, 3}      A  {1, 2, 3}
4      B     {1, 3}      B     {1, 3}
5      B     {1, 3}      C  {1, 2, 4}
6      C  {1, 2, 4}      A  {1, 2, 3}
7      C  {1, 2, 4}      B     {1, 3}
8      C  {1, 2, 4}      C  {1, 2, 4}

I then filtered out pairs where Club_x < Club_y so it removes duplicate pairs.

filtered = product[product["Club_x"] < product["Club_y"]]

  Club_x   Person_x Club_y   Person_y
1      A  {1, 2, 3}      B     {1, 3}
2      A  {1, 2, 3}      C  {1, 2, 4}
5      B     {1, 3}      C  {1, 2, 4}

Finally, I added the column with the overlap size and renamed columns as necessary.

result = filtered.assign(Num_Overlaps=filtered.apply(lambda row: len(row["Person_x"].intersection(row["Person_y"])), axis=1))
result = result.rename(columns={"Club_x": "Club 1", "Club_y": "Club 2"}).drop(["Person_x", "Person_y"], axis=1)

  Club 1 Club 2  Num_Overlaps
1      A      B             2
2      A      C             2
5      B      C             1

CodePudding user response:

You can indeed do this with groupby and some set manipulation. I would also use itertools.combinations, to get the list of club pairs.

import pandas as pd
from itertools import combinations

df = pd.DataFrame({'Person': [1, 1, 1, 2, 2, 3, 3, 4],
                   'Club': list('ABCACABC')})

members = df.groupby('Club').agg(set)
clubs = sorted(list(set(df.Club)))
overlap = pd.DataFrame(list(combinations(clubs, 2)),
                       columns=['Club 1', 'Club 2'])

def n_overlap(row):
    club1, club2 = row
    members1 = members.loc[club1, 'Person']
    members2 = members.loc[club2, 'Person']
    return len(members1.intersection(members2))

overlap['Num_Overlaps'] = overlap.apply(n_overlap, axis=1) 
overlap
    Club 1  Club 2  Num_Overlaps
0   A       B       2
1   A       C       2
2   B       C       1

Note there is one difference to your desired output, but that is probably as it should be, as noted by @rchome in the comment above.

  • Related