I would like to restructure groupby data into dataframe. Because the table has different length of groupbyed data, when I iterate those groupbyed data, IndexError: list index out of range comes up and I do not know how to ignore this index error.
Background of my data.
- I groupbyed by "Worker_id" and "Project_id"
Question | Answer | Worker_ID | Project_ID |
---|---|---|---|
DD | AB | X | Y |
DD | AB | X | Y |
DD | AB | X | Y |
BD | BG | K | Y |
BD | BG | K | Y |
KY | GG | J | Y |
KY | GG | J | Y |
KY | GG | J | Y |
KY | GG | J | Y |
RR | FR | X | Q |
HU | RT | K | Q |
HU | RT | K | Q |
HU | RT | K | Q |
YU | GE | J | Q |
YU | GE | J | Q |
XX | FF | K | P |
XX | FF | K | P |
XI | UF | J | P |
XI | UF | J | P |
(This table goes on much longer)
I would like to create dataframe like this↓↓
Question_1 | Answer_1 | Worker_ID_1 | Question_2 | Answer_2 | Worker_ID_2 | Question_3 | Answer_3 | Worker_ID_3 | Project_ID |
---|---|---|---|---|---|---|---|---|---|
DD | AB | X | BD | BG | K | KY | GG | J | Y |
DD | AB | X | BD | BG | K | KY | GG | J | Y |
DD | AB | X | Blank | Blank | Blank | KY | GG | J | Y |
Blank | Blank | Blank | Blank | Blank | Blank | KY | GG | J | Y |
RR | FR | X | HU | RT | K | YU | GE | J | Q |
Blank | Blank | Blank | HU | RT | K | YU | GE | J | Q |
Blank | Blank | Blank | HU | RT | K | Blank | Blank | Blank | Q |
Blank | Blank | Blank | XX | FF | K | XI | UF | J | P |
Blank | Blank | Blank | XX | FF | K | XI | UF | J | P |
(This table goes on much longer)
- Because each groupbyed data has different length, the index error comes up when I try with the longest list in my code.
My code↓↓
#groupby by "Worker_ID","Project_ID"
grouped_questions = {}
for x, y in df1.groupby(["Worker_ID","Project_ID"],as_index=True):
grouped_questions[x] = y.reset_index(drop = True)
#create lists of keys
unique_list = []
for unique in list(grouped_questions):
if unique not in unique_list:
unique_list.append(unique)
question_1 = []
question_2 = []
question_3 = []
for qn in unique_list:
worker = qn[2]
project = qn[3]
if worker == 'X':
question_1.append(qn)
elif worker == 'K':
question_2.append(qn)
elif worker == 'J':
question_3.append(qn)
#combine into dataframe
final_df1 = pd.DataFrame()
for index in range(1,max_length):
one_question = grouped_questions[question_1[index]]
two_question = grouped_questions[question_2[index]]
three_question = grouped_questions[question_3[index]]
merged_df1 = one_question.merge(two_question, how='outer', left_index=True, right_index=True, suffixes=["_1", "_2"])
merged_df2 = three_question.merge(four_question, how='outer', left_index=True, right_index=True, suffixes=["_3", "_4"])
if final_df1.shape[0] == 0:
final_df1 = merged_df5
else:
final_df1 = pd.concat([final_df1, merged_df3],ignore_index=True)
final_df1.reset_index(drop=True)
final_df1
CodePudding user response:
Here is what I have so far. As I said in the comments, there are more rows than in your desired output, but I don't know on what conditions I should join them.
# restructure the df with `pivot_table`
m1 = df.groupby('Project_ID').cumcount() 1
m2 = pd.factorize(df['Worker_ID'])[0] 1
out = (
df
.pivot_table(
index=['Project_ID', m1],
columns=m2,
values=['Worker_ID', 'Question', 'Answer'],
aggfunc='first'
)
)
# join the multiindex columns to single columns
out.columns = out.columns.map(lambda x: f"{x[0]}_{x[1]}")
# in case you need to automate the sorting of the columns, here is how to do it (otherwise you could just pass a list with the new order of the columns)
# 1st sort by numbers, then always bring them in the right order (question, answer, worker)
pattern = '|'.join([f"({x})" for x in ['Question', 'Answer', 'Worker']])
def sort_key(x):
return re.search(pattern,x).lastindex
new_order = sorted(out.columns.tolist(), key=lambda x: (int(x[-1]), sort_key(x.split('_')[0])))
out = out[new_order]
print(out)
Question_1 Answer_1 Worker_ID_1 Question_2 Answer_2 Worker_ID_2 Question_3 Answer_3 Worker_ID_3
Project_ID
P 1 NaN NaN NaN XX FF K NaN NaN NaN
2 NaN NaN NaN XX FF K NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN XI UF J
4 NaN NaN NaN NaN NaN NaN XI UF J
Q 1 RR FR X NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN HU RT K NaN NaN NaN
3 NaN NaN NaN HU RT K NaN NaN NaN
4 NaN NaN NaN HU RT K NaN NaN NaN
5 NaN NaN NaN NaN NaN NaN YU GE J
6 NaN NaN NaN NaN NaN NaN YU GE J
Y 1 DD AB X NaN NaN NaN NaN NaN NaN
2 DD AB X NaN NaN NaN NaN NaN NaN
3 DD AB X NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN BD BG K NaN NaN NaN
5 NaN NaN NaN BD BG K NaN NaN NaN
6 NaN NaN NaN NaN NaN NaN KY GG J
7 NaN NaN NaN NaN NaN NaN KY GG J
8 NaN NaN NaN NaN NaN NaN KY GG J
9 NaN NaN NaN NaN NaN NaN KY GG J