I have a dataset that looks like this:
Criteria | Answer | frequency |
---|---|---|
Criteria 1 | Answer A | arbitrary integer |
Criteria 2 | Answer B | arbitrary integer |
For each criterion, there is a range of answer options in the survey ranging from A to E, and there are 4 criteria. However, there are multiple instances of the criteria in the table.
I am trying to find out how I can collapse the information by criteria i.e. for each criteria, what is the spread of answers given by the respondents, in percentage form?
I have tried using groupby but to no avail.
CodePudding user response:
PLEASE NOTE I HAVE USED BY MISTAKE frequency float values and NOT integers.
sample data:
Criteria Answer frequency
0 Criteria 1 Answer A 0.1
1 Criteria 1 Answer A 0.2
2 Criteria 1 Answer A 0.6
3 Criteria 1 Answer A 0.3
4 Criteria 1 Answer B 0.7
5 Criteria 1 Answer B 0.4
6 Criteria 1 Answer B 0.9
7 Criteria 2 Answer A 0.1
8 Criteria 2 Answer A 0.1
9 Criteria 2 Answer A 0.1
10 Criteria 2 Answer C 0.1
11 Criteria 2 Answer C 0.4
12 Criteria 2 Answer C 0.7
df.groupby(["Criteria", "Answer"]).apply(lambda x: x.min())
Output:
frequency
Criteria Answer
Criteria 1 Answer A 0.5
Answer B 0.5
Criteria 2 Answer A 0.0
Answer C 0.6
df.groupby(["Criteria", "Answer"]).apply(lambda x: x.median())
Output
frequency
Criteria Answer
Criteria 1 Answer A 0.25
Answer B 0.70
Criteria 2 Answer A 0.10
Answer C 0.40
df.groupby(["Criteria", "Answer"]).apply(lambda x: x.std())
Output
frequency
Criteria Answer
Criteria 1 Answer A 2.160247e-01
Answer B 2.516611e-01
Criteria 2 Answer A 1.699675e-17
Answer C 3.000000e-01
if you do .reset_index():
df.groupby(["Criteria", "Answer"]).apply(lambda x: x.std()).reset_index()
Output:
Criteria Answer frequency
0 Criteria 1 Answer A 2.160247e-01
1 Criteria 1 Answer B 2.516611e-01
2 Criteria 2 Answer A 1.699675e-17
3 Criteria 2 Answer C 3.000000e-0
CodePudding user response:
IIUC:
df = df.groupby(['Criteria', 'Answer']).size().reset_index(name='size')
df['frequency'] = df.groupby('Criteria')[['size']].apply(lambda x: x.div(x.sum()))
df.drop(columns=['size'], inplace=True)
print(df)
OUTPUT
Criteria Answer frequency
0 Criteria 1 Answer A 0.6
1 Criteria 1 Answer B 0.4
2 Criteria 2 Answer A 0.4
3 Criteria 2 Answer B 0.4
4 Criteria 2 Answer C 0.2
SETUP
data = """
Criteria\tAnswer
Criteria 1\tAnswer A
Criteria 1\tAnswer A
Criteria 1\tAnswer A
Criteria 1\tAnswer B
Criteria 1\tAnswer B
Criteria 2\tAnswer A
Criteria 2\tAnswer A
Criteria 2\tAnswer B
Criteria 2\tAnswer B
Criteria 2\tAnswer C
"""
df = pd.read_csv(StringIO(data), sep='\t')