Consider a pandas DataFrame with 2 columns: image_id
and name
- Each row represents one person (name) located in an image (image_id)
- Each image can have 1 or more people
- Each name can appear at most once in an image
- Friendship order does not matter, e.g. Bob & Mary = Mary & Bob
How can I count how many times two people occur in the same image across the entire dataset?
data = [[1, 'Mary'], [1, 'Bob'], [1, 'Susan'],
[2, 'Bob'], [2, 'Joe'],
[3, 'Isaac'],
[4, 'Mary'], [4, 'Susan'],
[5, 'Mary'], [5, 'Susan'], [5, 'Bob'], [5, 'Joe']]
df = pd.DataFrame(data, columns=['image_id', 'name'])
# Now what?
Expected dataframe (order of rows or names doesn't matter):
name1 name2 count
Mary Susan 3
Bob Susan 2
Mary Bob 2
Bob Joe 2
Mary Joe 1
Susan Joe 1
Alternative solution:
It would also be acceptable to have a symmetric grid where rows and columns are all names, and the cell value is the number of times those two people have appeared in the same image. Whatever is easier.
CodePudding user response:
We can use crosstab
to calculate frequency table then calculate the inner product on this frequency table to count the number of times two people occur in same image
s = pd.crosstab(df['image_id'], df['name'])
c = s.T @ s
c = c.mask(np.triu(c, 1) == 0).stack()\
.rename_axis(['name1', 'name2']).reset_index(name='count')
name1 name2 count
0 Bob Joe 2.0
1 Bob Mary 2.0
2 Bob Susan 2.0
3 Joe Mary 1.0
4 Joe Susan 1.0
5 Mary Susan 3.0
EDIT by OP:
Here's a detailed explanation of the above code:
# Compute a frequency table of names that appear in each image.
s = pd.crosstab(df['image_id'], df['name'])
name Bob Isaac Joe Mary Susan
image_id
1 1 0 0 1 1
2 1 0 1 0 0
3 0 1 0 0 0
4 0 0 0 1 1
5 1 0 1 1 1
# Inner product counts the occurrences of each pair.
# The diagonal counts the number of times a name appeared in any image.
c = s.T @ s
name Bob Isaac Joe Mary Susan
name
Bob 3 0 2 2 2
Isaac 0 1 0 0 0
Joe 2 0 2 1 1
Mary 2 0 1 3 3
Susan 2 0 1 3 3
# Keep the non-zero elements in the upper triangle, since matrix is symmetric.
c = c.mask(np.triu(c, 1) == 0)
name Bob Isaac Joe Mary Susan
name
Bob NaN NaN 2.0 2.0 2.0
Isaac NaN NaN NaN NaN NaN
Joe NaN NaN NaN 1.0 1.0
Mary NaN NaN NaN NaN 3.0
Susan NaN NaN NaN NaN NaN
# Group all counts in a single column.
# Each row represents a unique pair of names.
c = c.stack()
name name
Bob Joe 2.0
Mary 2.0
Susan 2.0
Joe Mary 1.0
Susan 1.0
Mary Susan 3.0
# Expand the MultiIndex into separate columns.
c = c.rename_axis(['name1', 'name2']).reset_index(name='count')
name1 name2 count
0 Bob Joe 2.0
1 Bob Mary 2.0
2 Bob Susan 2.0
3 Joe Mary 1.0
4 Joe Susan 1.0
5 Mary Susan 3.0
See crosstab, @ (matrix mult.), T (transpose), triu, mask and stack for more details.
CodePudding user response:
I know the answer is already made and accepted by the user. But still, I want to share my code. This is my code to achieve the expected output by the "HARD WAY".
import itertools
import pandas as pd
data = [[1, 'Mary'], [1, 'Bob'], [1, 'Susan'],
[2, 'Bob'], [2, 'Joe'],
[3, 'Isaac'],
[4, 'Mary'], [4, 'Susan'],
[5, 'Mary'], [5, 'Susan'], [5, 'Bob'], [5, 'Joe']]
df = pd.DataFrame(data, columns=['image_id', 'name'])
# Group the df by 'image_id' and get the value of name in the form of list
groups = df.groupby(['image_id'])['name'].apply(list).reset_index()
output = {}
# Loop through the groups dataframe
for index, row in groups.iterrows():
# Sort the list of names in ascending order
row['name'].sort()
# Get the all possible combination of list in pair of twos
temp = list(itertools.combinations(row['name'], 2))
# Loop through it and maintain the output dictionary with its occurrence
# Default set occurrence value to 1 when initialize
# Increment it when we found more occurrence of it
for i, val in enumerate(temp):
if val not in output:
output[val] = 1
else:
output[val] = 1
temp_output = []
# Reformat the output dictionary so we can initialize it into pandas dictionary
for key, val in output.items():
temp = [key[0], key[1], val]
temp_output.append(temp)
df = pd.DataFrame(temp_output, columns=['name1', 'name2', 'count'])
print(df.sort_values(by=['count'], ascending=False))
And this is the output I am getting:
name1 name2 count
2 Mary Susan 3
0 Bob Mary 2
1 Bob Susan 2
3 Bob Joe 2
4 Joe Mary 1
5 Joe Susan 1
This is "NOT THE PYTHONIC" way, but this is how I solve most of my problems, which is not that good but it does my job.
NOTE: How code works is already mentioned in the comments but still if anyone of you has any doubts/questions/suggestions, then kindly let me know.