Create columns that repeat a value from a column based on a matching value from another column-CodePudding

I am working with survey data where I need to create several columns from the value provided for some of the questions in the survey. Say there are two submissions to the survey and two questions in each submission that I want to make a column of based on their response, the answers to these questions should map to the submission id.

So if we have:

       submission_id question_id  answer
    0              1           a    Male
    1              1           b     Cat
    2              2           a  Female
    3              2           b     Dog

And I wanted to create columns to track the sex and favorite animal for each submission, I would expect a table like.

       submission_id question_id  answer     sex favorite_animal
    0              1           a    Male    Male             Cat
    1              1           b     Cat    Male             Cat
    2              2           a  Female  Female             Dog
    3              2           b     Dog  Female             Dog

I have gotten this to work using an np.where() when there are just a few submissions that I can enter manually, but there are more than 20 submissions in the data. Ex:

# Get all submission ids
submission_ids = df.submission_id.unique().tolist()
    
# Get all the animals 
pet_list = df[(df['submission_id'] == submission_ids[0]) & (df['question_id'] == 'b') | (df['submission_id'] == submission_ids[1]) & (df['question_id'] == 'b')].loc[:,'answer'].values

# Create new column of animals per submission id
df['favorite_animal'] = np.where(df.submission_id == submission_ids[0], pet_list[0], np.where(df.submission_id == submission_ids[1], pet_list[1], 'fail'))

In an attempt to scale this to the number of submissions in the data, the below lambda function also produced the generator in each row that I'm having trouble extracting and understanding if it will produce the desired result

df['favorite_animal'] = df[['question_id', 'submission_id', 'answer']].apply(lambda x: (x['answer'] if x['question_id'] == 'b' and x['submission_id'] == submission_id else 'fail' for submission_id in submission_ids), axis=1)

CodePudding user response：

Do the pivot then merge

out = df.merge(df.pivot(*df).reset_index().rename(columns={'a':'gender','b':'favorite_animal'}))
Out[125]: 
   submission_id question_id  answer  gender favorite_animal
0              1           a    Male    Male             Cat
1              1           b     Cat    Male             Cat
2              2           a  Female  Female             Dog
3              2           b     Dog  Female             Dog

CodePudding user response：

I was able to learn from what BENY provided in their answer to work for my use case. Here I have 5 question_ids that need to be created into new columns and tracked for each submission_id.

# Subset out questions to become columns
submission_ids = df.submission_id.unique().tolist()
question_ids = [1, 2, 3, 4, 5]
questions_df = df[df['question_id'].isin(question_ids)]
questions_df = questions_df[['submission_id', 'question_id', 'answer']]
questions_df.reset_index(drop=True, inplace=True)

# Create pivot table of questions
questions_df_pivot_table = questions_df.pivot_table(index='submission_id', columns='question_id', values='answer', aggfunc=lambda x: ''.join(x)).reset_index().rename(columns={1: 'col_1', 2: 'col_2', 3: 'col_3', 4: 'col_4', 5: 'col_5'})
questions_df_pivot_table.columns.name = None

# Merge back into main df
df = df.merge(questions_df_pivot_table)