Friend Finder: Count how many times two unique values occur in the same group without order-CodePudding

Consider a pandas DataFrame with 2 columns: image_id and name

Each row represents one person (name) located in an image (image_id)
Each image can have 1 or more people
Each name can appear at most once in an image
Friendship order does not matter, e.g. Bob & Mary = Mary & Bob

How can I count how many times two people occur in the same image across the entire dataset?

data = [[1, 'Mary'], [1, 'Bob'], [1, 'Susan'],
        [2, 'Bob'], [2, 'Joe'],
        [3, 'Isaac'],
        [4, 'Mary'], [4, 'Susan'],
        [5, 'Mary'], [5, 'Susan'], [5, 'Bob'], [5, 'Joe']]

df = pd.DataFrame(data, columns=['image_id', 'name'])

# Now what?

Expected dataframe (order of rows or names doesn't matter):

 name1    name2   count
  Mary    Susan       3
   Bob    Susan       2
  Mary      Bob       2
   Bob      Joe       2
  Mary      Joe       1
 Susan      Joe       1

Alternative solution:

It would also be acceptable to have a symmetric grid where rows and columns are all names, and the cell value is the number of times those two people have appeared in the same image. Whatever is easier.

CodePudding user response：

We can use crosstab to calculate frequency table then calculate the inner product on this frequency table to count the number of times two people occur in same image

s = pd.crosstab(df['image_id'], df['name'])
c = s.T @ s
c = c.mask(np.triu(c, 1) == 0).stack()\
     .rename_axis(['name1', 'name2']).reset_index(name='count')

  name1  name2  count
0   Bob    Joe    2.0
1   Bob   Mary    2.0
2   Bob  Susan    2.0
3   Joe   Mary    1.0
4   Joe  Susan    1.0
5  Mary  Susan    3.0

EDIT by OP:

Here's a detailed explanation of the above code:

# Compute a frequency table of names that appear in each image.
s = pd.crosstab(df['image_id'], df['name'])

name      Bob  Isaac  Joe  Mary  Susan
image_id                              
1           1      0    0     1      1
2           1      0    1     0      0
3           0      1    0     0      0
4           0      0    0     1      1
5           1      0    1     1      1

# Inner product counts the occurrences of each pair.
# The diagonal counts the number of times a name appeared in any image.
c = s.T @ s

name   Bob  Isaac  Joe  Mary  Susan
name                               
Bob      3      0    2     2      2
Isaac    0      1    0     0      0
Joe      2      0    2     1      1
Mary     2      0    1     3      3
Susan    2      0    1     3      3

# Keep the non-zero elements in the upper triangle, since matrix is symmetric.
c = c.mask(np.triu(c, 1) == 0)

name   Bob  Isaac  Joe  Mary  Susan
name                               
Bob    NaN    NaN  2.0   2.0    2.0
Isaac  NaN    NaN  NaN   NaN    NaN
Joe    NaN    NaN  NaN   1.0    1.0
Mary   NaN    NaN  NaN   NaN    3.0
Susan  NaN    NaN  NaN   NaN    NaN

# Group all counts in a single column.
# Each row represents a unique pair of names.
c = c.stack()

name  name 
Bob   Joe      2.0
      Mary     2.0
      Susan    2.0
Joe   Mary     1.0
      Susan    1.0
Mary  Susan    3.0

# Expand the MultiIndex into separate columns.
c = c.rename_axis(['name1', 'name2']).reset_index(name='count')

  name1  name2  count
0   Bob    Joe    2.0
1   Bob   Mary    2.0
2   Bob  Susan    2.0
3   Joe   Mary    1.0
4   Joe  Susan    1.0
5  Mary  Susan    3.0

See crosstab, @ (matrix mult.), T (transpose), triu, mask and stack for more details.

CodePudding user response：

I know the answer is already made and accepted by the user. But still, I want to share my code. This is my code to achieve the expected output by the "HARD WAY".

import itertools
import pandas as pd

data = [[1, 'Mary'], [1, 'Bob'], [1, 'Susan'],
        [2, 'Bob'], [2, 'Joe'],
        [3, 'Isaac'],
        [4, 'Mary'], [4, 'Susan'],
        [5, 'Mary'], [5, 'Susan'], [5, 'Bob'], [5, 'Joe']]

df = pd.DataFrame(data, columns=['image_id', 'name'])

# Group the df by 'image_id' and get the value of name in the form of list
groups = df.groupby(['image_id'])['name'].apply(list).reset_index()

output = {}

# Loop through the groups dataframe
for index, row in groups.iterrows():
    # Sort the list of names in ascending order
    row['name'].sort()
    # Get the all possible combination of list in pair of twos
    temp = list(itertools.combinations(row['name'], 2))
    # Loop through it and maintain the output dictionary with its occurrence
    # Default set occurrence value to 1 when initialize
    # Increment it when we found more occurrence of it
    for i, val in enumerate(temp):
        if val not in output:
            output[val] = 1
        else:
            output[val]  = 1

temp_output = []

# Reformat the output dictionary so we can initialize it into pandas dictionary
for key, val in output.items():
    temp = [key[0], key[1], val]
    temp_output.append(temp)

df = pd.DataFrame(temp_output, columns=['name1', 'name2', 'count'])

print(df.sort_values(by=['count'], ascending=False))

And this is the output I am getting:

  name1  name2  count
2  Mary  Susan      3
0   Bob   Mary      2
1   Bob  Susan      2
3   Bob    Joe      2
4   Joe   Mary      1
5   Joe  Susan      1

This is "NOT THE PYTHONIC" way, but this is how I solve most of my problems, which is not that good but it does my job.

NOTE: How code works is already mentioned in the comments but still if anyone of you has any doubts/questions/suggestions, then kindly let me know.