Home > Back-end >  Having the same index values when pivoting a dataframe from long to wide format gives an average val
Having the same index values when pivoting a dataframe from long to wide format gives an average val

Time:02-03

Context: I'm trying to pivot a long format dataframe to a wide format dataframe, however, I'm noticing a weird pattern on the wide format dataframe. It seems that if we have repeated values for the index (in my case, a date), it's almost like it's giving me an average instead of repeating each index value and keeping the original values?

Here's a minimal reproducible example:

    import datetime
    import pandas as pd
long_dataframe = pd.DataFrame({"Date": [
    datetime.datetime.strptime("01-01-2020", '%m-%d-%Y').date(),
    datetime.datetime.strptime("01-01-2020", '%m-%d-%Y').date(),
    datetime.datetime.strptime("01-02-2020", '%m-%d-%Y').date(),
    datetime.datetime.strptime("01-02-2020", '%m-%d-%Y').date(),
    datetime.datetime.strptime("01-03-2020", '%m-%d-%Y').date(),
    datetime.datetime.strptime("01-04-2020", '%m-%d-%Y').date(),
    datetime.datetime.strptime("01-04-2020", '%m-%d-%Y').date(),
    datetime.datetime.strptime("01-01-2020", '%m-%d-%Y').date(),
    datetime.datetime.strptime("01-01-2020", '%m-%d-%Y').date(),
    datetime.datetime.strptime("01-02-2020", '%m-%d-%Y').date(),
    datetime.datetime.strptime("01-02-2020", '%m-%d-%Y').date(),
    datetime.datetime.strptime("01-03-2020", '%m-%d-%Y').date(),
    datetime.datetime.strptime("01-04-2020", '%m-%d-%Y').date(),
    datetime.datetime.strptime("01-04-2020", '%m-%d-%Y').date()
], "A": [
    "category_X", "category_X", "category_X", "category_X", "category_X", "category_X", "category_X",
    "category_Y", "category_Y", "category_Y", "category_Y", "category_Y", "category_Y", "category_Y"], "Values": [30, 40, 20, 30, 40, 50, 60,25,30,42,54,21,23,30]})

wide_dataframe = long_dataframe.reset_index().pivot_table(
    index="Date", columns="A", values="Values")

wide_dataframe

Which gives me this:

A           category_X  category_Y
Date
2020-01-01        35.0        27.5
2020-01-02        25.0        48.0
2020-01-03        40.0        21.0
2020-01-04        55.0        26.5

How can I make it so that I see the repeated dates with their original values? Why is it that for 2020-01-01 its giving the value in between this date (30 and 40)?

Desired output would look something like this:

A           category_X    category_Y
Date
2020-01-01          30       ...
2020-01-01          40
2020-01-02          20
2020-01-02          30
2020-01-03          40
2020-01-04          50
2020-01-04          60

How can I do this while keeping duplicated indices?

I was thinking of giving each row a unique ID, but I'd really like to do this directly using the dates if possible (without creting any additional IDs)

Thank you!

CodePudding user response:

pivot_table automatically aggregates. Since you cannot have duplicate values of the index after pivoting, you need to create a unique index. You can do this with groupby() and cumcount().

long_dataframe['count'] = long_dataframe.groupby('A').cumcount()

wide_dataframe = long_dataframe.pivot(index=['Date', 'count'], columns='A', values='Values') \
                               .reset_index() \
                               .drop('count', axis=1)

Output:

A        Date  category_X  category_Y
0  2020-01-01          30          25
1  2020-01-01          40          30
2  2020-01-02          20          42
3  2020-01-02          30          54
4  2020-01-03          40          21
5  2020-01-04          50          23
6  2020-01-04          60          30
  • Related