Python Sorting and performing operations on groups pandas columns that follow a specific pattern-CodePudding

I need advanced python pandas help for my problem. I have such a dataframe which contains 4 string columns that identify the hierarchy of the item and the rest are columns that follow the following pattern:

Text | Period (example: "ACTUAL | 2020-08-01")

Text can be either: Actual, Forecast t-1 or Forecast t-3. Period can be any month-year: 2020-08-01 .... For example: "ACTUAL | 2020-08-01" "FORECAST T-1 | 2020-08-01" "FORECAST T-3 | 2020-08-01"

df.columns

show as:

['HIERARCHY 1', 'HIERARCHY 2', 'HIERARCHY 3',
   'ITEM NUMBER', 'ACTUAL|2020-06-01', 'ACTUAL|2020-07-01',
   'ACTUAL|2020-08-01', 'ACTUAL|2020-09-01', 'ACTUAL|2020-10-01',
   'ACTUAL|2020-11-01', 'ACTUAL|2020-12-01', 'ACTUAL|2021-01-01',
   'ACTUAL|2021-02-01', 'ACTUAL|2021-03-01', 'ACTUAL|2021-04-01',
   'ACTUAL|2021-05-01', 'ACTUAL|2021-06-01', 'ACTUAL|2021-07-01',
   'ACTUAL|2021-08-01', 'ACTUAL|2021-09-01', 'ACTUAL|2021-10-01',
   'FORECAST T-1|2021-01-01', 'FORECAST T-1|2021-08-01',
   'FORECAST T-1|2021-09-01', 'FORECAST T-1|2021-10-01',
   'FORECAST T-1|2021-11-01', 'FORECAST T-3|2021-03-01',
   'FORECAST T-3|2021-10-01', 'FORECAST T-3|2021-11-01',
   'FORECAST T-3|2021-12-01', 'FORECAST T-3|2022-01-01',
   'FORECAST T-1|2020-07-01', 'FORECAST T-1|2020-08-01',
   'FORECAST T-1|2020-09-01', 'FORECAST T-1|2020-10-01',
   'FORECAST T-1|2020-11-01', 'FORECAST T-1|2020-12-01',
   'FORECAST T-1|2021-02-01', 'FORECAST T-1|2021-03-01',
   'FORECAST T-1|2021-04-01', 'FORECAST T-1|2021-05-01',
   'FORECAST T-1|2021-06-01', 'FORECAST T-3|2020-09-01',
   'FORECAST T-3|2020-10-01', 'FORECAST T-3|2020-11-01',
   'FORECAST T-3|2020-12-01', 'FORECAST T-3|2021-01-01',
   'FORECAST T-3|2021-02-01', 'FORECAST T-3|2021-04-01',
   'FORECAST T-3|2021-05-01', 'FORECAST T-3|2021-06-01',
   'FORECAST T-3|2021-07-01', 'FORECAST T-3|2021-08-01'],

For any period we can have either: Actual, Forecast t-1 or Forecast t-3. There is no guarantee that there are the three of them present. Maybe only one of them, for example for specific period we can have only: "ACTUAL | 2020-08-01", "FORECAST T-1 | 2020-08-01", but there is no column showing: "FORECAST T-3 | 2020-08-01".

My goal is to achieve the following dataframe:

First step 1:

Re-order the column names as follows:

df.columns

It should be ordered by: Actual, Forecast t-1 and lastly Forecast t-3. Within each category, it should be from oldest period to most recent (left to right).

Step 2:

I want to perform operations for columns that belong to the same period. I want to loop through each of the periods within the column names, and for each period, I want to see if there is column: "ACTUAL | Any Period (ex. 2020-08-01)" and Forecast metrics for that periods, if there are: then I want to compute the absolute error, which would be computed as:

df["ABS ERROR T-1 | 2020-08-01"] = (df["ACTUAL | 2020-08-01"] - df["FORECAST T-1 | 2020-08-01"]).abs()

df["ABS ERROR T-3 | 2020-08-01"] = (df["ACTUAL | 2020-08-01"] - df["FORECAST T-3 | 2020-08-01"]).abs()

If there is no ACTUAL in the current period, then skip. If there is no FORECAST (either T-1 or T-3), then skip the period as well.

If there is at least ACTUAL and 1 Forecast period then compute the Absolute Error.

Then, put it on the dataframe at the end as a new column. I have more than 50 columns. I cannot do this manually so I need to find a pattern for my problem.

CodePudding user response：

Your column names having dates in them makes your dataframe hard to read. I would suggest restructuring your data in order to have dates as index.

Assuming you start with the following dataframe:

  HIERARCHY 1 HIERARCHY 2 HIERARCHY 3 ITEM NUMBER  ACTUAL|2020-06-01  ...  FORECAST T-3|2021-04-01  FORECAST T-3|2021-05-01  FORECAST T-3|2021-06-01  FORECAST T-3|2021-07-01  FORECAST T-3|2021-08-01
0     Level 1     Group A  Category 1      Item 1                 30  ...                       64                       91                      100                       34                       87
1     Level 1     Group B  Category 1      Item 2                 88  ...                       20                      118                       22                       97                       90
2     Level 2     Group F  Category 1      Item 2                 91  ...                       88                       46                      116                       50                       59

I would first split the column names and add date as index. This way you create several dataframes for each ['HIERARCHY 1', 'HIERARCHY 2', 'HIERARCHY 3','ITEM NUMBER', 'date'] combination. You can then concatenate those dataframes:

result_df = []

for col in df.columns:
    if '|' in col:
        name, date = col.split('|')
        df_res = df[['HIERARCHY 1', 'HIERARCHY 2', 'HIERARCHY 3','ITEM NUMBER', col]]
        df_res = df_res.rename(columns={col:name})
        df_res['date'] = date
        result_df.append(df_res)

result = pd.concat(result_df).groupby(['HIERARCHY 1', 'HIERARCHY 2', 'HIERARCHY 3','ITEM NUMBER', 'date']).mean()

You can then group and aggregate your concatenated result by ['HIERARCHY 1', 'HIERARCHY 2', 'HIERARCHY 3','ITEM NUMBER', 'date'], filter NaN values and calculate absolute error:

result = result[~result['ACTUAL'].isna()]
forecast_avail_filter = result[result.columns[1:]].dropna(how='all').index
result = result.loc[forecast_avail_filter]

for col in result.columns[1:]:
    ref = col.strip('FORECAST ')
    result['ABS ERR '   ref] = (result["ACTUAL"] - result[col]).abs()
print(result)

Output:

                                                            ACTUAL  FORECAST T-1  FORECAST T-3  ABS ERR -1  ABS ERR -3
HIERARCHY 1 HIERARCHY 2 HIERARCHY 3 ITEM NUMBER date
Level 1     Group A     Category 1  Item 1      2020-07-01   112.0         106.0           NaN         6.0         NaN
                                                2020-08-01   114.0          59.0           NaN        55.0         NaN
                                                2020-09-01    76.0         109.0          99.0        33.0        23.0
                                                2020-10-01   107.0         118.0          73.0        11.0        34.0
                                                2020-11-01    60.0          43.0          35.0        17.0        25.0
                                                2020-12-01    31.0         101.0          24.0        70.0         7.0
                                                2021-01-01    82.0          65.0          50.0        17.0        32.0
                                                2021-02-01    54.0          71.0         111.0        17.0        57.0
                                                2021-03-01    44.0         119.0          97.0        75.0        53.0
                                                2021-04-01    25.0          47.0          64.0        22.0        39.0
                                                2021-05-01    69.0         102.0          91.0        33.0        22.0
                                                2021-06-01    53.0          89.0         100.0        36.0        47.0
                                                2021-07-01   110.0           NaN          34.0         NaN        76.0
                                                2021-08-01    56.0          81.0          87.0        25.0        31.0
                                                2021-09-01    86.0          28.0           NaN        58.0         NaN
                                                2021-10-01    61.0          23.0          79.0        38.0        18.0
            Group B     Category 1  Item 2      2020-07-01    96.0          99.0           NaN         3.0         NaN
                                                2020-08-01    35.0          24.0           NaN        11.0         NaN
                                                2020-09-01   109.0         106.0          56.0         3.0        53.0
                                                2020-10-01    83.0         104.0          28.0        21.0        55.0
                                                2020-11-01    66.0          45.0          60.0        21.0         6.0
                                                2020-12-01    55.0          89.0          47.0        34.0         8.0
                                                2021-01-01    42.0          93.0          52.0        51.0        10.0
                                                2021-02-01   114.0          21.0          33.0        93.0        81.0
                                                2021-03-01    46.0          80.0          73.0        34.0        27.0
                                                2021-04-01   107.0          37.0          20.0        70.0        87.0
                                                2021-05-01   116.0          76.0         118.0        40.0         2.0
                                                2021-06-01    25.0          65.0          22.0        40.0         3.0
                                                2021-07-01    77.0           NaN          97.0         NaN        20.0
                                                2021-08-01    27.0          57.0          90.0        30.0        63.0
                                                2021-09-01   101.0         115.0           NaN        14.0         NaN
                                                2021-10-01    75.0          43.0          48.0        32.0        27.0
Level 2     Group F     Category 1  Item 2      2020-07-01   107.0          62.0           NaN        45.0         NaN
                                                2020-08-01    60.0          27.0           NaN        33.0         NaN
                                                2020-09-01    89.0          22.0          81.0        67.0         8.0
                                                2020-10-01    48.0          25.0         118.0        23.0        70.0
                                                2020-11-01   109.0          87.0          24.0        22.0        85.0
                                                2020-12-01    44.0          21.0          86.0        23.0        42.0
                                                2021-01-01    96.0          30.0          75.0        66.0        21.0
                                                2021-02-01    26.0          65.0          56.0        39.0        30.0
                                                2021-03-01    41.0          93.0         105.0        52.0        64.0
                                                2021-04-01    32.0          53.0          88.0        21.0        56.0
                                                2021-05-01    28.0          38.0          46.0        10.0        18.0
                                                2021-06-01    78.0          67.0         116.0        11.0        38.0
                                                2021-07-01    99.0           NaN          50.0         NaN        49.0
                                                2021-08-01    71.0         114.0          59.0        43.0        12.0
                                                2021-09-01    61.0          82.0           NaN        21.0         NaN
                                                2021-10-01    40.0          36.0          90.0         4.0        50.0