I am working with survey data where I need to create several columns from the value provided for some of the questions in the survey. Say there are two submissions to the survey and two questions in each submission that I want to make a column of based on their response, the answers to these questions should map to the submission id.
So if we have:
submission_id question_id answer
0 1 a Male
1 1 b Cat
2 2 a Female
3 2 b Dog
And I wanted to create columns to track the sex and favorite animal for each submission, I would expect a table like.
submission_id question_id answer sex favorite_animal
0 1 a Male Male Cat
1 1 b Cat Male Cat
2 2 a Female Female Dog
3 2 b Dog Female Dog
I have gotten this to work using an np.where()
when there are just a few submissions that I can enter manually, but there are more than 20 submissions in the data. Ex:
# Get all submission ids
submission_ids = df.submission_id.unique().tolist()
# Get all the animals
pet_list = df[(df['submission_id'] == submission_ids[0]) & (df['question_id'] == 'b') | (df['submission_id'] == submission_ids[1]) & (df['question_id'] == 'b')].loc[:,'answer'].values
# Create new column of animals per submission id
df['favorite_animal'] = np.where(df.submission_id == submission_ids[0], pet_list[0], np.where(df.submission_id == submission_ids[1], pet_list[1], 'fail'))
In an attempt to scale this to the number of submissions in the data, the below lambda function also produced the generator in each row that I'm having trouble extracting and understanding if it will produce the desired result
df['favorite_animal'] = df[['question_id', 'submission_id', 'answer']].apply(lambda x: (x['answer'] if x['question_id'] == 'b' and x['submission_id'] == submission_id else 'fail' for submission_id in submission_ids), axis=1)
CodePudding user response:
Do the pivot
then merge
out = df.merge(df.pivot(*df).reset_index().rename(columns={'a':'gender','b':'favorite_animal'}))
Out[125]:
submission_id question_id answer gender favorite_animal
0 1 a Male Male Cat
1 1 b Cat Male Cat
2 2 a Female Female Dog
3 2 b Dog Female Dog
CodePudding user response:
I was able to learn from what BENY provided in their answer to work for my use case. Here I have 5 question_ids that need to be created into new columns and tracked for each submission_id.
# Subset out questions to become columns
submission_ids = df.submission_id.unique().tolist()
question_ids = [1, 2, 3, 4, 5]
questions_df = df[df['question_id'].isin(question_ids)]
questions_df = questions_df[['submission_id', 'question_id', 'answer']]
questions_df.reset_index(drop=True, inplace=True)
# Create pivot table of questions
questions_df_pivot_table = questions_df.pivot_table(index='submission_id', columns='question_id', values='answer', aggfunc=lambda x: ''.join(x)).reset_index().rename(columns={1: 'col_1', 2: 'col_2', 3: 'col_3', 4: 'col_4', 5: 'col_5'})
questions_df_pivot_table.columns.name = None
# Merge back into main df
df = df.merge(questions_df_pivot_table)