Home > Blockchain >  How to transform multiple choice questions into readable data with pandas in python?
How to transform multiple choice questions into readable data with pandas in python?

Time:11-21

I have survey data that I am trying to clean up into something more useable for analysis.

I have been able to find decent guides on most of the problems, but one - tacking multiple choice questions which are passed into a single column.

What's the best practice to split it into separate columns?

I normally would do this in SQL and I would make a CASE WHEN check for each of the different options, creating as many columns as there are options.

This is an example table of what I am working with:


 ---- ---------------------------- 
| ID | When did you attend?       |
 ---- ---------------------------- 
| 1  | Monday\nTuesday            |
 ---- ---------------------------- 
| 2  | Monday\nTuesday\nWednesday |
 ---- ---------------------------- 
| 3  | Monday\nFriday\nSaturday   |
 ---- ---------------------------- 

This is what I think would make sense to transform into:

 ---- ---------------------------- -------- --------- ----------- ---------- -------- ---------- -------- 
| ID | When did you attend?       | Monday | Tuesday | Wednesday | Thursday | Friday | Saturday | Sunday |
 ---- ---------------------------- -------- --------- ----------- ---------- -------- ---------- -------- 
| 1  | Monday\nTuesday            | True   | True    | False     | False    | False  | False    | False  |
 ---- ---------------------------- -------- --------- ----------- ---------- -------- ---------- -------- 
| 2  | Monday\nTuesday\nWednesday | True   | True    | True      | False    | False  | False    | False  |
 ---- ---------------------------- -------- --------- ----------- ---------- -------- ---------- -------- 
| 3  | Monday\nFriday\nSaturday   | True   | False   | False     | False    | True   | True     | False  |
 ---- ---------------------------- -------- --------- ----------- ---------- -------- ---------- -------- 

I have figured out so far how to pass the parts of the strings in the second column into a list using df.iloc[:,1].str.split('\n') , but I haven't figured out the next steps (even if that one is necessary).

To me it seems that would make it easier to segment the rest of the data based on what days were attended.

CodePudding user response:

We can use str.get_dummies to split the strings around the separator \n and create a dataframe of indicator variables, then reindex this dataframe along the columns axis and change the dtype to boolean, finally join the given dataframe with the indicator dataframe to get the result

import calendar

df.join(
    df['When did you attend?'].str.get_dummies(sep='\n')
    .reindex(calendar.day_name, axis=1, fill_value=0).astype(bool)
)

   ID        When did you attend?  Monday  Tuesday  Wednesday  Thursday  Friday  Saturday  Sunday
0   1             Monday\nTuesday    True     True      False     False   False     False   False
1   2  Monday\nTuesday\nWednesday    True     True       True     False   False     False   False
2   3    Monday\nFriday\nSaturday    True    False      False     False    True      True   False

CodePudding user response:

After turning the strings into lists, you can explode the column of lists to get a separate row for each day. Then, you can pivot the table to make it a column for each day.

Using this dataframe:

df = pd.DataFrame({
    'ID': [1,2,3],
    'When did you attend?': ['Monday\nTuesday', 'Monday\nTuesday\nWednesday', 'Monday\nFriday\nSaturday']
})

Split the strings:

df['When did you attend?'] = df['When did you attend?'].str.split('\n')
   ID          When did you attend?
0   1             [Monday, Tuesday]
1   2  [Monday, Tuesday, Wednesday]
2   3    [Monday, Friday, Saturday]

Explode, creating a row for each ID, day attended:

df = df.explode('When did you attend?')
   ID When did you attend?
0   1               Monday
0   1              Tuesday
1   2               Monday
1   2              Tuesday
1   2            Wednesday
...

Pivot to put the days as columns:

df = df.pivot(index='ID', columns='When did you attend?', values='When did you attend?')
When did you attend?  Friday  Monday  Saturday  Tuesday  Wednesday
ID                                                                
1                        NaN  Monday       NaN  Tuesday        NaN
2                        NaN  Monday       NaN  Tuesday  Wednesday
3                     Friday  Monday  Saturday      NaN        NaN

Finally, reorder the columns and convert the values to booleans:

df = df[['Monday', 'Tuesday', 'Wednesday', 'Friday', 'Saturday']]
df = df.fillna(0).astype('bool')
When did you attend?  Monday  Tuesday  Wednesday  Friday  Saturday
ID                                                                
1                       True     True      False   False     False
2                       True     True       True   False     False
3                       True    False      False    True      True

Depending on what you're doing later, just exploding without pivoting might be a more useful structure.

Note that days that were not in the data are missing from the columns. You may want to add columns for the missing days, unless your data contains every day anyways.

CodePudding user response:

df.join(df["When did you attend?"].str.split("\n").explode().rename("Day")).assign(
    exist=True
).pivot(index=["ID", "When did you attend?"], columns="Day").droplevel(0, 1).reindex(
    ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"],
    axis=1,
).fillna(
    False
).reset_index().rename_axis(
    columns=None
)

Breaking it down:

  1. Explode the column after splitting it
tmp = df["When did you attend?"].str.split("\n").explode().rename("Day")
tmp
>>>
0       Monday
0      Tuesday
1       Monday
1      Tuesday
1    Wednesday
2       Monday
2       Friday
2     Saturday
Name: Day, dtype: object
  1. Join back to original df and create an exist column of True's (used as the value column in the pivot later)
df.join(tmp).assign(exist=True)
   ID        When did you attend?        Day  exist
0   1             Monday\nTuesday     Monday   True
0   1             Monday\nTuesday    Tuesday   True
1   2  Monday\nTuesday\nWednesday     Monday   True
1   2  Monday\nTuesday\nWednesday    Tuesday   True
1   2  Monday\nTuesday\nWednesday  Wednesday   True
2   3    Monday\nFriday\nSaturday     Monday   True
2   3    Monday\nFriday\nSaturday     Friday   True
2   3    Monday\nFriday\nSaturday   Saturday   True
  1. Pivot cleanups (drop column levels, re-index all days of week to account for missing days in the dataset, then fillna with False)
   ID        When did you attend?  Monday  Tuesday  Wednesday  Thursday  Friday  Saturday  Sunday
0   1             Monday\nTuesday    True     True      False     False   False     False   False
1   2  Monday\nTuesday\nWednesday    True     True       True     False   False     False   False
2   3    Monday\nFriday\nSaturday    True    False      False     False    True      True   False
  • Related