Home > Back-end >  Collapsing Survey Data in Pandas
Collapsing Survey Data in Pandas

Time:12-22

I have a dataset that looks like this:

Criteria Answer frequency
Criteria 1 Answer A arbitrary integer
Criteria 2 Answer B arbitrary integer

For each criterion, there is a range of answer options in the survey ranging from A to E, and there are 4 criteria. However, there are multiple instances of the criteria in the table.

I am trying to find out how I can collapse the information by criteria i.e. for each criteria, what is the spread of answers given by the respondents, in percentage form?

I have tried using groupby but to no avail.

CodePudding user response:

PLEASE NOTE I HAVE USED BY MISTAKE frequency float values and NOT integers.

sample data:

      Criteria    Answer  frequency
0   Criteria 1  Answer A        0.1
1   Criteria 1  Answer A        0.2
2   Criteria 1  Answer A        0.6
3   Criteria 1  Answer A        0.3
4   Criteria 1  Answer B        0.7
5   Criteria 1  Answer B        0.4
6   Criteria 1  Answer B        0.9
7   Criteria 2  Answer A        0.1
8   Criteria 2  Answer A        0.1
9   Criteria 2  Answer A        0.1
10  Criteria 2  Answer C        0.1
11  Criteria 2  Answer C        0.4
12  Criteria 2  Answer C        0.7

df.groupby(["Criteria", "Answer"]).apply(lambda x: x.min())

Output:

                     frequency
Criteria   Answer
Criteria 1 Answer A        0.5
           Answer B        0.5
Criteria 2 Answer A        0.0
           Answer C        0.6


df.groupby(["Criteria", "Answer"]).apply(lambda x: x.median())

Output

                     frequency
Criteria   Answer
Criteria 1 Answer A       0.25
           Answer B       0.70
Criteria 2 Answer A       0.10
           Answer C       0.40


df.groupby(["Criteria", "Answer"]).apply(lambda x: x.std())

Output

                        frequency
Criteria   Answer
Criteria 1 Answer A  2.160247e-01
           Answer B  2.516611e-01
Criteria 2 Answer A  1.699675e-17
           Answer C  3.000000e-01

if you do .reset_index():

df.groupby(["Criteria", "Answer"]).apply(lambda x: x.std()).reset_index()

Output:

     Criteria    Answer     frequency
0  Criteria 1  Answer A  2.160247e-01
1  Criteria 1  Answer B  2.516611e-01
2  Criteria 2  Answer A  1.699675e-17
3  Criteria 2  Answer C  3.000000e-0

CodePudding user response:

IIUC:

df = df.groupby(['Criteria', 'Answer']).size().reset_index(name='size')
df['frequency'] = df.groupby('Criteria')[['size']].apply(lambda x: x.div(x.sum()))
df.drop(columns=['size'], inplace=True)
print(df)

OUTPUT

     Criteria    Answer  frequency
0  Criteria 1  Answer A        0.6
1  Criteria 1  Answer B        0.4
2  Criteria 2  Answer A        0.4
3  Criteria 2  Answer B        0.4
4  Criteria 2  Answer C        0.2

SETUP

data = """
Criteria\tAnswer
Criteria 1\tAnswer A
Criteria 1\tAnswer A
Criteria 1\tAnswer A
Criteria 1\tAnswer B
Criteria 1\tAnswer B
Criteria 2\tAnswer A
Criteria 2\tAnswer A
Criteria 2\tAnswer B
Criteria 2\tAnswer B
Criteria 2\tAnswer C
"""

df = pd.read_csv(StringIO(data), sep='\t')
  • Related