Comparing pairs of unique column values in pandas-CodePudding

I have the following dataframe:

name   group  feedback     question
a      g1     False        abc
a      g1     True         abc
a      g1     True         xyz
b      g1     True         xyz
b      g1     True         abc
c      g1     False        def
d      g2     False        xyz
d      g2     True         xyz
e      g2     True         xyz

I am trying to do a statistic that contains the number of questions answered by each group (it doesn't matter how many times), the total number of questions, the number of questions answered by both groups and the ratio between mutual questions and total questions:

Team1 Team2 NumberQuestionsT1 NumberQuestionsT2 AllQuestions CommonQuestions  Ratio 
g1    g2                    3                 1            4               1   0.25

What I did was actually counting unique values "manually" with groupby, nunique etc, because it's simple when the group column has only two distinct values. However, is it possible to compare pairs of two groups and build dataframes with one row like above if my dataframe has many groups? Example:

name   group  feedback     question
a      g1     False        abc
a      g1     True         abc
a      g1     True         xyz
b      g1     True         xyz
b      g1     True         abc
c      g1     False        def
d      g2     False        xyz
d      g2     True         xyz
e      g2     True         xyz
f      g3     True         abc
f      g3     True         www
g      g3     False        xyz
h      g4     True         www
h      g4     True         qqq
i      g4     False        xyz

So I would want to compare g1 with g2, g1 with g3, g1 with g4, g2 with g3 and g3 with g4.

Team1 Team2 NumberQuestionsT1 NumberQuestionsT2 AllQuestions CommonQuestions  Ratio 
g1    g2                    3                 1            4               1   0.25 
g1    g3                    3                 3            4       2           0.50
g1    g4                    3                 3            4               1   0.25 
g2    g3                    1                 3            4       1           0.25
g2    g4                    1                 3            4       1           0.25
g3    g4                    3                 3            4               2   0.50

CodePudding user response：

Here's an attempt. First, find the set of questions answered for each group:

>>> g = df.groupby('group')['question'].apply(set)
>>> g
group
g1    {xyz, abc, def}
g2              {xyz}
g3    {xyz, abc, www}
g4    {qqq, xyz, www}
Name: question, dtype: object

Then, use itertools to get the combinations you want, computing the indices for each desired column:

import itertools

g = df.groupby('group')['question'].apply(set) # as above
rows = []
total = len(df.question.unique())

for a, b in itertools.combinations(g.index, 2):
    rows.append({'Team1':a,
                 'Team2':b,
                 'NumberQuestionsT1':len(g[a]),
                 'NumberQuestionsT2':len(g[b]),
                 'AllQuestions': total,
                 'CommonQuestions': len(g[a] & g[b]),
                 'Ratio': len(g[a] & g[b]) / total})
output = pd.DataFrame(rows)

Giving:

  Team1 Team2  NumberQuestionsT1  NumberQuestionsT2  AllQuestions  \
0    g1    g2                  3                  1             5   
1    g1    g3                  3                  3             5   
2    g1    g4                  3                  3             5   
3    g2    g3                  1                  3             5   
4    g2    g4                  1                  3             5   
5    g3    g4                  3                  3             5   

   CommonQuestions  Ratio  
0                1    0.2  
1                2    0.4  
2                1    0.2  
3                1    0.2  
4                1    0.2  
5                2    0.4

I've replace the AllQuestions calculation with the one you commented. Now, it seems like the difference is that your example input has 5 question types (abc, def, qqq, www, xyz)?

I couldn't think of a way to do this without itertools, and I am also using a for loop for the initial determination of every set of questions. So I reckon there could be a significantly faster method, but this may be sufficient given the size of your data.