I have the following dataframe:
name group feedback question
a g1 False abc
a g1 True abc
a g1 True xyz
b g1 True xyz
b g1 True abc
c g1 False def
d g2 False xyz
d g2 True xyz
e g2 True xyz
I am trying to do a statistic that contains the number of questions answered by each group (it doesn't matter how many times), the total number of questions, the number of questions answered by both groups and the ratio between mutual questions and total questions:
Team1 Team2 NumberQuestionsT1 NumberQuestionsT2 AllQuestions CommonQuestions Ratio
g1 g2 3 1 4 1 0.25
What I did was actually counting unique values "manually" with groupby
, nunique
etc, because it's simple when the group
column has only two distinct values. However, is it possible to compare pairs of two groups and build dataframes with one row like above if my dataframe has many group
s? Example:
name group feedback question
a g1 False abc
a g1 True abc
a g1 True xyz
b g1 True xyz
b g1 True abc
c g1 False def
d g2 False xyz
d g2 True xyz
e g2 True xyz
f g3 True abc
f g3 True www
g g3 False xyz
h g4 True www
h g4 True qqq
i g4 False xyz
So I would want to compare g1
with g2
, g1
with g3
, g1
with g4
, g2
with g3
and g3
with g4
.
Team1 Team2 NumberQuestionsT1 NumberQuestionsT2 AllQuestions CommonQuestions Ratio
g1 g2 3 1 4 1 0.25
g1 g3 3 3 4 2 0.50
g1 g4 3 3 4 1 0.25
g2 g3 1 3 4 1 0.25
g2 g4 1 3 4 1 0.25
g3 g4 3 3 4 2 0.50
CodePudding user response:
Here's an attempt. First, find the set of questions answered for each group:
>>> g = df.groupby('group')['question'].apply(set)
>>> g
group
g1 {xyz, abc, def}
g2 {xyz}
g3 {xyz, abc, www}
g4 {qqq, xyz, www}
Name: question, dtype: object
Then, use itertools
to get the combinations you want, computing the indices for each desired column:
import itertools
g = df.groupby('group')['question'].apply(set) # as above
rows = []
total = len(df.question.unique())
for a, b in itertools.combinations(g.index, 2):
rows.append({'Team1':a,
'Team2':b,
'NumberQuestionsT1':len(g[a]),
'NumberQuestionsT2':len(g[b]),
'AllQuestions': total,
'CommonQuestions': len(g[a] & g[b]),
'Ratio': len(g[a] & g[b]) / total})
output = pd.DataFrame(rows)
Giving:
Team1 Team2 NumberQuestionsT1 NumberQuestionsT2 AllQuestions \
0 g1 g2 3 1 5
1 g1 g3 3 3 5
2 g1 g4 3 3 5
3 g2 g3 1 3 5
4 g2 g4 1 3 5
5 g3 g4 3 3 5
CommonQuestions Ratio
0 1 0.2
1 2 0.4
2 1 0.2
3 1 0.2
4 1 0.2
5 2 0.4
I've replace the AllQuestions
calculation with the one you commented. Now, it seems like the difference is that your example input has 5 question types (abc
, def
, qqq
, www
, xyz
)?
I couldn't think of a way to do this without itertools
, and I am also using a for loop for the initial determination of every set
of questions. So I reckon there could be a significantly faster method, but this may be sufficient given the size of your data.