I have a table like:
─────── ────────
| Name | Pet |
─────── ────────
| Alice | Dog |
| Bob | Dog |
| Clyde | Horse |
| Dave | Dog |
| Alice | Cat |
| Bob | Cat |
─────── ────────
And I want to count the number of pets that (Person1, Person2) have in common, so in this table that would be something like:
──────── ──────── ───────────────────────────────────
| Name1 | Name2 | Pet |
──────── ──────── ───────────────────────────────────
| Alice | Bob | 2 (because both have Dogs, Cats) |
| Alice | Clyde | 0 |
| Alice | Dave | 1 |
| Bob | Clyde | 0 |
| Bob | Dave | 1 |
| Clyde | Dave | 0 |
──────── ──────── ───────────────────────────────────
CodePudding user response:
Working within Python space might help with this :
from itertools import combinations
temp = pd.DataFrame(combinations(df.Name.unique(), 2),
columns=['Name1', 'Name2'])
mapp = df.groupby('Name').Pet.agg(set)
tempo = temp.assign(Name1 = temp.Name1.map(mapp),
Name2 = temp.Name2.map(mapp))
Pet_count = [len(a.intersection(b))
for a, b in zip(tempo.Name1, tempo.Name2)]
temp.assign(Pet = Pet_count)
Name1 Name2 Pet
0 Alice Bob 2
1 Alice Clyde 0
2 Alice Dave 1
3 Bob Clyde 0
4 Bob Dave 1
5 Clyde Dave 0
CodePudding user response:
setup
df = pd.DataFrame(
{
"Name":["Alice", "Bob", "Clyde", "Dave", "Alice", "Bob"],
"Pet":["Dog", "Dog", "Horse", "Dog", "Cat", "Cat"],
}
)
We can get a table (dataframe) indexed by each person, which gives us truthy values for whether they own a particular pet like so
has_pet = pd.get_dummies(df["Pet"]).groupby(df["Name"]).sum()
has_pet
will look like this
Cat Dog Horse
Alice 1 1 0
Bob 1 1 0
Clyde 0 0 1
Dave 0 1 0
Pick two people and the number of pets in common is equal to the dot product between their rows. We can calculate this with one operation using Dataframe.dot
common_matrix = has_pet.dot(has_pet.transpose())
common_matrix
looks like this
Alice Bob Clyde Dave
Alice 2 2 0 1
Bob 2 2 0 1
Clyde 0 0 1 0
Dave 1 1 0 1
This has the data you need. In order to get it in the format you want we can do the following.
(
common_matrix
.melt(ignore_index=False) # turns the column headers to a single column of names
.rename(columns={"Name":"Name2"}) # rename the column to avoid a name clash with index, which is also called Name
.reset_index()
.query("Name < Name2") # used standard alphabetical order to remove repeated values
)
The result is
Name Name2 value
4 Alice Bob 2
8 Alice Clyde 0
9 Bob Clyde 0
12 Alice Dave 1
13 Bob Dave 1
14 Clyde Dave 0
You can sort the Name
column to get the desired answer you gave (if it's necessary)