Home > other >  Edgelist Pandas: Get Number of Common Values between Two Combos in a Column
Edgelist Pandas: Get Number of Common Values between Two Combos in a Column

Time:10-22

I have a table like:

 ─────── ──────── 
| Name  | Pet    |
 ─────── ──────── 
| Alice | Dog    |
| Bob   | Dog    |
| Clyde | Horse  |
| Dave  | Dog    |
| Alice | Cat    |
| Bob   | Cat    |
 ─────── ──────── 

And I want to count the number of pets that (Person1, Person2) have in common, so in this table that would be something like:

 ──────── ──────── ─────────────────────────────────── 
| Name1  | Name2  | Pet                               |
 ──────── ──────── ─────────────────────────────────── 
| Alice  | Bob    | 2 (because both have Dogs, Cats)  |
| Alice  | Clyde  | 0                                 |
| Alice  | Dave   | 1                                 |
| Bob    | Clyde  | 0                                 |
| Bob    | Dave   | 1                                 |
| Clyde  | Dave   | 0                                 |
 ──────── ──────── ─────────────────────────────────── 

CodePudding user response:

Working within Python space might help with this :

from itertools import combinations
temp = pd.DataFrame(combinations(df.Name.unique(), 2), 
                    columns=['Name1', 'Name2'])

 mapp = df.groupby('Name').Pet.agg(set)
tempo = temp.assign(Name1 = temp.Name1.map(mapp), 
                    Name2 = temp.Name2.map(mapp))

Pet_count = [len(a.intersection(b)) 
             for a, b in zip(tempo.Name1, tempo.Name2)]

temp.assign(Pet = Pet_count)

   Name1  Name2  Pet
0  Alice    Bob    2
1  Alice  Clyde    0
2  Alice   Dave    1
3    Bob  Clyde    0
4    Bob   Dave    1
5  Clyde   Dave    0

CodePudding user response:

setup

df = pd.DataFrame(
    {
        "Name":["Alice", "Bob", "Clyde", "Dave", "Alice", "Bob"],
        "Pet":["Dog", "Dog", "Horse", "Dog", "Cat", "Cat"],
    }
)

We can get a table (dataframe) indexed by each person, which gives us truthy values for whether they own a particular pet like so

has_pet = pd.get_dummies(df["Pet"]).groupby(df["Name"]).sum()

has_pet will look like this

       Cat  Dog  Horse
Alice    1    1      0
Bob      1    1      0
Clyde    0    0      1
Dave     0    1      0

Pick two people and the number of pets in common is equal to the dot product between their rows. We can calculate this with one operation using Dataframe.dot

common_matrix = has_pet.dot(has_pet.transpose())

common_matrix looks like this

       Alice  Bob  Clyde  Dave
Alice      2    2      0     1
Bob        2    2      0     1
Clyde      0    0      1     0
Dave       1    1      0     1

This has the data you need. In order to get it in the format you want we can do the following.

(
    common_matrix
    .melt(ignore_index=False)  # turns the column headers to a single column of names
    .rename(columns={"Name":"Name2"})  # rename the column to avoid a name clash with index, which is also called Name
    .reset_index()
    .query("Name < Name2")  # used standard alphabetical order to remove repeated values
)

The result is

     Name  Name2  value
4   Alice    Bob      2
8   Alice  Clyde      0
9     Bob  Clyde      0
12  Alice   Dave      1
13    Bob   Dave      1
14  Clyde   Dave      0

You can sort the Name column to get the desired answer you gave (if it's necessary)

  • Related