I have survey data that I am trying to clean up into something more useable for analysis.
I have been able to find decent guides on most of the problems, but one - tacking multiple choice questions which are passed into a single column.
What's the best practice to split it into separate columns?
I normally would do this in SQL and I would make a CASE WHEN check for each of the different options, creating as many columns as there are options.
This is an example table of what I am working with:
---- ----------------------------
| ID | When did you attend? |
---- ----------------------------
| 1 | Monday\nTuesday |
---- ----------------------------
| 2 | Monday\nTuesday\nWednesday |
---- ----------------------------
| 3 | Monday\nFriday\nSaturday |
---- ----------------------------
This is what I think would make sense to transform into:
---- ---------------------------- -------- --------- ----------- ---------- -------- ---------- --------
| ID | When did you attend? | Monday | Tuesday | Wednesday | Thursday | Friday | Saturday | Sunday |
---- ---------------------------- -------- --------- ----------- ---------- -------- ---------- --------
| 1 | Monday\nTuesday | True | True | False | False | False | False | False |
---- ---------------------------- -------- --------- ----------- ---------- -------- ---------- --------
| 2 | Monday\nTuesday\nWednesday | True | True | True | False | False | False | False |
---- ---------------------------- -------- --------- ----------- ---------- -------- ---------- --------
| 3 | Monday\nFriday\nSaturday | True | False | False | False | True | True | False |
---- ---------------------------- -------- --------- ----------- ---------- -------- ---------- --------
I have figured out so far how to pass the parts of the strings in the second column into a list using df.iloc[:,1].str.split('\n')
, but I haven't figured out the next steps (even if that one is necessary).
To me it seems that would make it easier to segment the rest of the data based on what days were attended.
CodePudding user response:
We can use str.get_dummies
to split the strings around the separator \n
and create a dataframe of indicator variables, then reindex
this dataframe along the columns axis and change the dtype
to boolean, finally join
the given dataframe with the indicator dataframe to get the result
import calendar
df.join(
df['When did you attend?'].str.get_dummies(sep='\n')
.reindex(calendar.day_name, axis=1, fill_value=0).astype(bool)
)
ID When did you attend? Monday Tuesday Wednesday Thursday Friday Saturday Sunday
0 1 Monday\nTuesday True True False False False False False
1 2 Monday\nTuesday\nWednesday True True True False False False False
2 3 Monday\nFriday\nSaturday True False False False True True False
CodePudding user response:
After turning the strings into lists, you can explode the column of lists to get a separate row for each day. Then, you can pivot the table to make it a column for each day.
Using this dataframe:
df = pd.DataFrame({
'ID': [1,2,3],
'When did you attend?': ['Monday\nTuesday', 'Monday\nTuesday\nWednesday', 'Monday\nFriday\nSaturday']
})
Split the strings:
df['When did you attend?'] = df['When did you attend?'].str.split('\n')
ID When did you attend?
0 1 [Monday, Tuesday]
1 2 [Monday, Tuesday, Wednesday]
2 3 [Monday, Friday, Saturday]
Explode, creating a row for each ID, day attended:
df = df.explode('When did you attend?')
ID When did you attend?
0 1 Monday
0 1 Tuesday
1 2 Monday
1 2 Tuesday
1 2 Wednesday
...
Pivot to put the days as columns:
df = df.pivot(index='ID', columns='When did you attend?', values='When did you attend?')
When did you attend? Friday Monday Saturday Tuesday Wednesday
ID
1 NaN Monday NaN Tuesday NaN
2 NaN Monday NaN Tuesday Wednesday
3 Friday Monday Saturday NaN NaN
Finally, reorder the columns and convert the values to booleans:
df = df[['Monday', 'Tuesday', 'Wednesday', 'Friday', 'Saturday']]
df = df.fillna(0).astype('bool')
When did you attend? Monday Tuesday Wednesday Friday Saturday
ID
1 True True False False False
2 True True True False False
3 True False False True True
Depending on what you're doing later, just exploding without pivoting might be a more useful structure.
Note that days that were not in the data are missing from the columns. You may want to add columns for the missing days, unless your data contains every day anyways.
CodePudding user response:
df.join(df["When did you attend?"].str.split("\n").explode().rename("Day")).assign(
exist=True
).pivot(index=["ID", "When did you attend?"], columns="Day").droplevel(0, 1).reindex(
["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"],
axis=1,
).fillna(
False
).reset_index().rename_axis(
columns=None
)
Breaking it down:
- Explode the column after splitting it
tmp = df["When did you attend?"].str.split("\n").explode().rename("Day")
tmp
>>>
0 Monday
0 Tuesday
1 Monday
1 Tuesday
1 Wednesday
2 Monday
2 Friday
2 Saturday
Name: Day, dtype: object
- Join back to original df and create an exist column of True's (used as the value column in the pivot later)
df.join(tmp).assign(exist=True)
ID When did you attend? Day exist
0 1 Monday\nTuesday Monday True
0 1 Monday\nTuesday Tuesday True
1 2 Monday\nTuesday\nWednesday Monday True
1 2 Monday\nTuesday\nWednesday Tuesday True
1 2 Monday\nTuesday\nWednesday Wednesday True
2 3 Monday\nFriday\nSaturday Monday True
2 3 Monday\nFriday\nSaturday Friday True
2 3 Monday\nFriday\nSaturday Saturday True
- Pivot cleanups (drop column levels, re-index all days of week to account for missing days in the dataset, then fillna with False)
ID When did you attend? Monday Tuesday Wednesday Thursday Friday Saturday Sunday
0 1 Monday\nTuesday True True False False False False False
1 2 Monday\nTuesday\nWednesday True True True False False False False
2 3 Monday\nFriday\nSaturday True False False False True True False