Home > Net >  advanced dataframe reshaping in pandas
advanced dataframe reshaping in pandas

Time:04-01

I am trying to reshape a dataframe at monthly level without much success. I have a dataframe that contains data that spans a given period: either monthly, quarterly or yearly. Essentially I would like to reshape the dataframe as follows: once all the data available at monthly level is exhausted, use the quarterly values, then once all quarterly values are exhausted, use the yearly values. Do you have any idea how I could do that?

Many thanks for your help!

Inputs:

var_name        begin_delivery_date    end_delivery_date      value
Monthly 2022    2022-01-01T06:00:00    2022-02-01T05:59:59      5
Monthly 2022    2022-02-01T06:00:00    2022-03-01T05:59:59      7
   ...                 ...                  ...                ...
Quarterly 2022  2022-01-01T06:00:00    2022-04-01T06:00:00      10
   ...                 ...                  ...                ...
Yearly 2022     2022-01-01T06:00:00    2023-01-01T06:00:00     49

expected output:

  date        var_name        value
  2022-01-01   Monthly 2022     5
  2022-02-01   Monthly 2022     7
  2022-03-01  Quarterly 2022    10
  2022-04-01  Yearly 2022       49
  2022-05-01  Yearly 2022       49
  2022-06-01  Yearly 2022       49 
  2022-07-01  Yearly 2022       49 
  2022-08-01  Yearly 2022       49
  2022-09-01  Yearly 2022       49 
  2022-10-01  Yearly 2022       49
  2022-11-01  Yearly 2022       49
  2022-12-01  Yearly 2022       49

input data to play with:

 {  {
        "begin_delivery_date": "2022-01-01T06:00:00",
        "var name": "Monthly 2022",
        "end_delivery_date": "2022-02-01T05:59:59",
        "value": 5
    },
    {
        "begin_delivery_date": "2022-02-01T06:00:00",
        "var name": "Monthly 2022",
        "end_delivery_date": "2022-03-01T05:59:59",
        "value": 7
    },
    {
        "begin_delivery_date": "2022-03-01T06:00:00",
        "var name": "Monthly 2022",
        "end_delivery_date": "2022-04-01T05:59:59",
        "value": 8
    },
    {
        "begin_delivery_date": "2022-04-01T06:00:00",
        "var name": "Monthly 2022",
        "end_delivery_date": "2022-05-01T05:59:59",
        "value": 9
    },
    {
        "begin_delivery_date": "2022-04-01T06:00:00",
        "var name": "Quarterly 2022",
        "end_delivery_date": "2022-07-01T05:59:59",
        "value": 10
    },
    {
        "begin_delivery_date": "2022-07-01T06:00:00",
        "var name": "Quarterly 2022",
        "end_delivery_date": "2022-10-01T05:59:59",
        "value": 11
    },
    {
        "begin_delivery_date": "2022-09-01T06:00:00",
        "var name": "Quarterly 2022",
        "end_delivery_date": "2023-01-01T05:59:59",
        "value": 12
    },
    {
        "begin_delivery_date": "2023-01-01T06:00:00",
        "var name": "Yearly 2023",
        "end_delivery_date": "2024-01-01T05:59:59",
        "value": 50
    },
    {
        "begin_delivery_date": "2024-01-01T06:00:00",
        "var name": "Yearly 2024",
        "end_delivery_date": "2025-01-01T05:59:59",
        "value": 60
    }
 }

CodePudding user response:

IIUC,

import pandas as pd
import numpy as np

data =  [ {
        "begin_delivery_date": "2022-01-01T06:00:00",
        "var name": "Monthly 2022",
        "end_delivery_date": "2022-02-01T05:59:59",
        "value": 5
    },
    {
        "begin_delivery_date": "2022-02-01T06:00:00",
        "var name": "Monthly 2022",
        "end_delivery_date": "2022-03-01T05:59:59",
        "value": 7
    },
    {
        "begin_delivery_date": "2022-03-01T06:00:00",
        "var name": "Monthly 2022",
        "end_delivery_date": "2022-04-01T05:59:59",
        "value": 8
    },
    {
        "begin_delivery_date": "2022-04-01T06:00:00",
        "var name": "Monthly 2022",
        "end_delivery_date": "2022-05-01T05:59:59",
        "value": 9
    },
    {
        "begin_delivery_date": "2022-04-01T06:00:00",
        "var name": "Quarterly 2022",
        "end_delivery_date": "2022-07-01T05:59:59",
        "value": 10
    },
    {
        "begin_delivery_date": "2022-07-01T06:00:00",
        "var name": "Quarterly 2022",
        "end_delivery_date": "2022-10-01T05:59:59",
        "value": 11
    },
    {
        "begin_delivery_date": "2022-09-01T06:00:00",
        "var name": "Quarterly 2022",
        "end_delivery_date": "2023-01-01T05:59:59",
        "value": 12
    },
    {
        "begin_delivery_date": "2023-01-01T06:00:00",
        "var name": "Yearly 2023",
        "end_delivery_date": "2024-01-01T05:59:59",
        "value": 50
    },
    {
        "begin_delivery_date": "2024-01-01T06:00:00",
        "var name": "Yearly 2024",
        "end_delivery_date": "2025-01-01T05:59:59",
        "value": 60
    }
]

df = pd.DataFrame(data)

Create list of dates from date range and explode dataframe.

df['dates'] = [pd.date_range(s, e, freq='M') for s, e in zip(df['begin_delivery_date'], df['end_delivery_date'])]

df_out = df.explode('dates')
print(df_out)

Output:

   begin_delivery_date        var name    end_delivery_date  value               dates
0  2022-01-01T06:00:00    Monthly 2022  2022-02-01T05:59:59      5 2022-01-31 06:00:00
1  2022-02-01T06:00:00    Monthly 2022  2022-03-01T05:59:59      7 2022-02-28 06:00:00
2  2022-03-01T06:00:00    Monthly 2022  2022-04-01T05:59:59      8 2022-03-31 06:00:00
3  2022-04-01T06:00:00    Monthly 2022  2022-05-01T05:59:59      9 2022-04-30 06:00:00
4  2022-04-01T06:00:00  Quarterly 2022  2022-07-01T05:59:59     10 2022-04-30 06:00:00
4  2022-04-01T06:00:00  Quarterly 2022  2022-07-01T05:59:59     10 2022-05-31 06:00:00
4  2022-04-01T06:00:00  Quarterly 2022  2022-07-01T05:59:59     10 2022-06-30 06:00:00
5  2022-07-01T06:00:00  Quarterly 2022  2022-10-01T05:59:59     11 2022-07-31 06:00:00
5  2022-07-01T06:00:00  Quarterly 2022  2022-10-01T05:59:59     11 2022-08-31 06:00:00
5  2022-07-01T06:00:00  Quarterly 2022  2022-10-01T05:59:59     11 2022-09-30 06:00:00
6  2022-09-01T06:00:00  Quarterly 2022  2023-01-01T05:59:59     12 2022-09-30 06:00:00
6  2022-09-01T06:00:00  Quarterly 2022  2023-01-01T05:59:59     12 2022-10-31 06:00:00
6  2022-09-01T06:00:00  Quarterly 2022  2023-01-01T05:59:59     12 2022-11-30 06:00:00
6  2022-09-01T06:00:00  Quarterly 2022  2023-01-01T05:59:59     12 2022-12-31 06:00:00
7  2023-01-01T06:00:00     Yearly 2023  2024-01-01T05:59:59     50 2023-01-31 06:00:00
7  2023-01-01T06:00:00     Yearly 2023  2024-01-01T05:59:59     50 2023-02-28 06:00:00
7  2023-01-01T06:00:00     Yearly 2023  2024-01-01T05:59:59     50 2023-03-31 06:00:00
7  2023-01-01T06:00:00     Yearly 2023  2024-01-01T05:59:59     50 2023-04-30 06:00:00
7  2023-01-01T06:00:00     Yearly 2023  2024-01-01T05:59:59     50 2023-05-31 06:00:00
7  2023-01-01T06:00:00     Yearly 2023  2024-01-01T05:59:59     50 2023-06-30 06:00:00
7  2023-01-01T06:00:00     Yearly 2023  2024-01-01T05:59:59     50 2023-07-31 06:00:00
7  2023-01-01T06:00:00     Yearly 2023  2024-01-01T05:59:59     50 2023-08-31 06:00:00
7  2023-01-01T06:00:00     Yearly 2023  2024-01-01T05:59:59     50 2023-09-30 06:00:00
7  2023-01-01T06:00:00     Yearly 2023  2024-01-01T05:59:59     50 2023-10-31 06:00:00
7  2023-01-01T06:00:00     Yearly 2023  2024-01-01T05:59:59     50 2023-11-30 06:00:00
7  2023-01-01T06:00:00     Yearly 2023  2024-01-01T05:59:59     50 2023-12-31 06:00:00
8  2024-01-01T06:00:00     Yearly 2024  2025-01-01T05:59:59     60 2024-01-31 06:00:00
8  2024-01-01T06:00:00     Yearly 2024  2025-01-01T05:59:59     60 2024-02-29 06:00:00
8  2024-01-01T06:00:00     Yearly 2024  2025-01-01T05:59:59     60 2024-03-31 06:00:00
8  2024-01-01T06:00:00     Yearly 2024  2025-01-01T05:59:59     60 2024-04-30 06:00:00
8  2024-01-01T06:00:00     Yearly 2024  2025-01-01T05:59:59     60 2024-05-31 06:00:00
8  2024-01-01T06:00:00     Yearly 2024  2025-01-01T05:59:59     60 2024-06-30 06:00:00
8  2024-01-01T06:00:00     Yearly 2024  2025-01-01T05:59:59     60 2024-07-31 06:00:00
8  2024-01-01T06:00:00     Yearly 2024  2025-01-01T05:59:59     60 2024-08-31 06:00:00
8  2024-01-01T06:00:00     Yearly 2024  2025-01-01T05:59:59     60 2024-09-30 06:00:00
8  2024-01-01T06:00:00     Yearly 2024  2025-01-01T05:59:59     60 2024-10-31 06:00:00
8  2024-01-01T06:00:00     Yearly 2024  2025-01-01T05:59:59     60 2024-11-30 06:00:00
8  2024-01-01T06:00:00     Yearly 2024  2025-01-01T05:59:59     60 2024-12-31 06:00:00

CodePudding user response:

Create a df and shuffle it(data are the data you wrote above)

df = pd.DataFrame(data)
df = df.sample(frac=1).reset_index(drop=True)

Split each value from var name into 2 separate columns,var_name_pediod and var_name_year

df["var_name_pediod"] = df["var name"].str.split(" ").str[0]
df["var_name_year"] = df["var name"].str.split(" ").str[1]

Create a dictionary for sorting the periods and replace the "var_name_pediod" column with the dictionary

sort_dic = {"Monthly":1,"Quarterly":2,"Yearly":3}
df["var_name_pediod"] = df["var_name_pediod"].replace(sort_dic)

Sort values by "var_name_pediod" column

df.sort_values(by=['var_name_pediod'], inplace=True)

Groupby var_name_pediod and sort by "var_name_year

df.groupby(['var_name_pediod']).apply(lambda x: x.sort_values(by=['var_name_year'])).reset_index(drop=True)

Done. Drop extra columns if you don't need them

df.drop(columns=["var_name_pediod","var_name_year"],inplace=True)
  • Related