Context: I'm trying to pivot a long format dataframe to a wide format dataframe, however, I'm noticing a weird pattern on the wide format dataframe. It seems that if we have repeated values for the index (in my case, a date), it's almost like it's giving me an average instead of repeating each index value and keeping the original values?
Here's a minimal reproducible example:
import datetime
import pandas as pd
long_dataframe = pd.DataFrame({"Date": [
datetime.datetime.strptime("01-01-2020", '%m-%d-%Y').date(),
datetime.datetime.strptime("01-01-2020", '%m-%d-%Y').date(),
datetime.datetime.strptime("01-02-2020", '%m-%d-%Y').date(),
datetime.datetime.strptime("01-02-2020", '%m-%d-%Y').date(),
datetime.datetime.strptime("01-03-2020", '%m-%d-%Y').date(),
datetime.datetime.strptime("01-04-2020", '%m-%d-%Y').date(),
datetime.datetime.strptime("01-04-2020", '%m-%d-%Y').date(),
datetime.datetime.strptime("01-01-2020", '%m-%d-%Y').date(),
datetime.datetime.strptime("01-01-2020", '%m-%d-%Y').date(),
datetime.datetime.strptime("01-02-2020", '%m-%d-%Y').date(),
datetime.datetime.strptime("01-02-2020", '%m-%d-%Y').date(),
datetime.datetime.strptime("01-03-2020", '%m-%d-%Y').date(),
datetime.datetime.strptime("01-04-2020", '%m-%d-%Y').date(),
datetime.datetime.strptime("01-04-2020", '%m-%d-%Y').date()
], "A": [
"category_X", "category_X", "category_X", "category_X", "category_X", "category_X", "category_X",
"category_Y", "category_Y", "category_Y", "category_Y", "category_Y", "category_Y", "category_Y"], "Values": [30, 40, 20, 30, 40, 50, 60,25,30,42,54,21,23,30]})
wide_dataframe = long_dataframe.reset_index().pivot_table(
index="Date", columns="A", values="Values")
wide_dataframe
Which gives me this:
A category_X category_Y
Date
2020-01-01 35.0 27.5
2020-01-02 25.0 48.0
2020-01-03 40.0 21.0
2020-01-04 55.0 26.5
How can I make it so that I see the repeated dates with their original values? Why is it that for 2020-01-01 its giving the value in between this date (30 and 40)?
Desired output would look something like this:
A category_X category_Y
Date
2020-01-01 30 ...
2020-01-01 40
2020-01-02 20
2020-01-02 30
2020-01-03 40
2020-01-04 50
2020-01-04 60
How can I do this while keeping duplicated indices?
I was thinking of giving each row a unique ID, but I'd really like to do this directly using the dates if possible (without creting any additional IDs)
Thank you!
CodePudding user response:
pivot_table
automatically aggregates. Since you cannot have duplicate values of the index after pivoting, you need to create a unique index. You can do this with groupby()
and cumcount()
.
long_dataframe['count'] = long_dataframe.groupby('A').cumcount()
wide_dataframe = long_dataframe.pivot(index=['Date', 'count'], columns='A', values='Values') \
.reset_index() \
.drop('count', axis=1)
Output:
A Date category_X category_Y
0 2020-01-01 30 25
1 2020-01-01 40 30
2 2020-01-02 20 42
3 2020-01-02 30 54
4 2020-01-03 40 21
5 2020-01-04 50 23
6 2020-01-04 60 30