My question seems duplicate as I found different questions with the same error as follows:
Pandas: grouping a column on a value and creating new column headings
Python/Pandas - ValueError: Index contains duplicate entries, cannot reshape
Pandas pivot produces "ValueError: Index contains duplicate entries, cannot reshape
I tried all the solutions presented on those posts, but none worked. I believe the error maybe be caused by my dataset format, which has Strings instead of numbers. Here follows an example of my Dataset:
protocol_no | activity | description |
---|---|---|
1586212 | walk | twice a day |
1586212 | drive | 5 km |
1586212 | sleep | NaN |
1586212 | eat | 1500 calories |
2547852 | walk | NaN |
2547852 | drive | NaN |
2547852 | eat | 3200 calories |
2547852 | sleep | At least 10 hours |
The output I'm trying to achieve is:
protocol_no | walk | drive | sleep | eat |
---|---|---|---|---|
1586212 | twice a day | 5km | NaN | 1500 calories |
2547852 | NaN | NaN | 3200 calories | At least 10 hours |
I tried using pivot and pivot_table with a code like this:
df.pivot(index="protocol_no", columns="activity", values="description")
But I'm still getting this error:
ValueError: Index contains duplicate entries, cannot reshape
Have no idea what is going wrong, so any help will be helpful!
CodePudding user response:
Try using .piviot_table()
with aggfunc='first'
(or something similar) if you get duplicate index error when using .pivot()
df.pivot_table(index="protocol_no", columns="activity", values="description", aggfunc='first')
This is a common situation when the column you set as index
has duplicated values. Using aggfunc='first'
(or sometimes aggfunc='sum'
depending on condition) most probably can solve the problem.
Result:
activity drive eat sleep walk
protocol_no
1586212 5 km 1500 calories NaN twice a day
2547852 NaN 3200 calories At least 10 hours NaN