I want to create new columns based on a given column value and merge the rows with same subject names within a given date, for example for a given pandas dataframe
subject | questionid | answer | datetime |
---|---|---|---|
First | q1 | fa1 | 2021-02-27T |
First | q2 | fa2 | 2021-02-27T |
First | q3 | fa3 | 2021-02-27T |
First | q4 | fa4 | 2021-02-28T |
Second | q1 | sa1 | 2021-02-27T |
Second | q2 | 2021-02-27T | |
Second | q3 | sa3 | 2021-02-27T |
Second | q4 | 2021-02-27T |
I want my final dataframe to contain columns based on questionid
where each row contains data in each date range i.e. within 24 hours of datetime
for a single subject.
**
So, the final output should look like this,
**
subject | q1 | q2 | q3 | q4 | datetime |
---|---|---|---|---|---|
First | fa1 | fa2 | fa3 | nan | 2021-02-27T |
First | nan | nan | nan | fa4 | 2021-02-28T |
Second | sa1 | nan | sa3 | nan | 2021-02-27T |
Notice that the first
subject has two rows because of different dates the data were collected on.
I was able to create columns based on the questionid
in a new dataframe by using
df.pivot(columns='questionid', values='answer')
But the final dataframe looks like this,
subject | questionid | answer | datetime | q1 | q2 | q3 | q4 |
---|---|---|---|---|---|---|---|
First | q1 | fa1 | 2021-02-27T | fa1 | nan | nan | nan |
First | q2 | fa2 | 2021-02-27T | nan | fa2 | nan | nan |
First | q3 | fa3 | 2021-02-27T | nan | nan | fa3 | nan |
First | q4 | fa4 | 2021-02-28T | nan | nan | nan | fa4 |
Second | q1 | sa1 | 2021-02-27T | sa1 | nan | nan | nan |
Second | q2 | 2021-02-27T | nan | nan | nan | nan | |
Second | q3 | sa3 | 2021-02-27T | nan | nan | sa3 | nan |
Second | q4 | 2021-02-27T | nan | nan | nan | nan |
Is there an easier way without looping through the rows to create the desired output. Again
CodePudding user response:
You're almost there, use both ['subject', 'datetime']
as index in pivot
:
(df.pivot(index=['subject', 'datetime'], columns='questionid', values='answer')
.reset_index()
)
output:
subject datetime q1 q2 q3 q4
0 First 2021-02-27T fa1 fa2 fa3 NaN
1 First 2021-02-28T NaN NaN NaN fa4
2 Second 2021-02-27T sa1 NaN sa3 NaN
CodePudding user response:
I think that should do it:
pd.pivot_table(index=['subject', 'datetime'], values='answer', columns='questionid')