One dataframe looks like this (it stems from a bigger one that was sliced by workplace
, therefore the square brackets):
company_grouper[0] =
group workplace dep employee answer question
a w1 t1 smith True q1
a w1 t1 smith False q2
a w1 t1 smith True q2
a w1 t1 john False q1
a w1 t2 joe True q2
b w1 t1 don True q1
b w1 t1 don False q2
b w1 t2 sean True q3
c w1 t2 sean True q5
c w1 t3 liam False q5
c w1 t1 al True q1
So workplace
is always the same, team
doesn't matter, an employee
can be in multiple groups
and can answer
the same question
multiple times. I wanted to make a statistic and compare groups two by two because not all of them deal with the same questions. So firstly:
import itertools
g_8 = company_grouper[0].groupby('group')['question'].apply(set)
rows = []
for a, b in itertools.combinations(g_8.index, 2):
rows.append({'Group1': a,
'Group2': b,
'NumberQuestionsG1': len(g_8[a]),
'NumberQuestionsG2': len(g_8[b]),
'Q_G1_G2': len(list(set().union(g_8[a],g_8[b]))),
'AllQuestions': len(company_grouper[0].question.unique()),
'CommonQuestions': len(g_8[a] & g_8[b]),
'Ratio': len(g_8[a] & g_8[b]) / (len(company_grouper[0].question.unique())),
'Ratio_pair': len(g_8[a] & g_8[b]) / len(list(set().union(g_8[a],g_8[b])))})
output_g_8 = pd.DataFrame(rows)
The columns are unimportant for this post, the only thing that matters is that I take groups two by two, without repetitions. The above code works.
The problem is when I am trying to compute the averages for each group within each pair:
d_groups = {'Group1':'group1','Group2':'group2'}
result_8_partial = (company_grouper[0].merge(company_grouper[0], on='question', suffixes=('1','2'))
.query('group1 != group2')
.groupby(['group1','group2','question'], as_index=False)
.mean())
statistic_8 = result_8_partial.merge(output_g_8[['Group1','Group2']].rename(columns=d_groups))
statistic_8_averages = statistic_8.groupby(
['group1', 'group2'], as_index=False
).agg(Average1=('answer1', 'mean'), Average2=('answer2', 'mean'))
I don't understand why everything that I wrote here (see sample data below) works, but it doesn't work if I use that piece of data with the notation company_grouper[8]
. I get UndefinedVariableError: name 'group1' is not defined.
Here's the data to play around with:
company_grouper = pd.DataFrame({'group': ['a', 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c'],
'workplace': ['w1', 'w1', 'w1', 'w1', 'w1', 'w1', 'w1', 'w1', 'w1', 'w1', 'w1'],
'team': ['t1', 't1', 't1', 't1', 't2', 't1', 't1', 't2', 't2', 't3', 't1'],
'employee': ['smith', 'smith', 'smith', 'john', 'joe', 'don', 'don', 'sean','sean', 'liam','al'],
'answer': [True, False, True, False, True, True, False, True, True, False, True],
'question': ['q1','q2','q2','q1','q2','q1','q2','q3','q5','q5','q1']})
EDIT: How I got to the company_grouper[0]
dataframe:
df_big=
group workplace dep employee answer question
a w1 t1 smith True q1
a w1 t1 smith False q2
a w1 t1 smith True q2
a w1 t1 john False q1
a w1 t2 joe True q2
b w1 t1 don True q1
b w1 t1 don False q2
b w1 t2 sean True q3
c w1 t2 sean True q5
c w1 t3 liam False q5
c w1 t1 al True q1
z w2 t9 mary True q7
z w2 t9 mary False q8
y w2 t9 dan False q7
y w2 t8 ben True q9
w w3 t14 greg False q15
And then:
company_grouper = [g for _, g in df_big.groupby(['workplace'])]
CodePudding user response:
I'm not getting something, you are trying to access company_grouper by 8, but that's a data frame?
Can you post your original company_grouper, I'm expecting that it is a dictionary, or a list, an not a dataframe.
Be careful using the same name for different things.