Home > Blockchain >  pandas stack columns from column name, forecasting task
pandas stack columns from column name, forecasting task

Time:11-12

I am working with forecast data.

I want to create a new column (horizon) from the current column names, and stack the values from the two variables (["y", "yhat"])

import pandas as pd 
import numpy as np 
import itertools

rng = np.random.default_rng(2021)
index = list(itertools.product(
    ["pixel_1", "pixel_2"],
    pd.date_range("2001-01-01", "2002-01-01", freq="D")
))

dat = pd.DataFrame({
    "initialisation_time": np.array(index)[:, 1],
    "pixel": np.array(index)[:, 0],
    "y_0": rng.random(len(index)),
    "y_1": rng.random(len(index)),
    "yhat_0": rng.random(len(index)),
    "yhat_1": rng.random(len(index)),
})

Gives me a dataframe as below:

          time    pixel       y_0       y_1    yhat_0    yhat_1
0   2001-01-01  pixel_1  0.257135  0.609062  0.989102  0.297949
1   2001-01-02  pixel_1  0.898808  0.509193  0.966388  0.182610
2   2001-01-03  pixel_1  0.724221  0.537482  0.455078  0.141172
3   2001-01-04  pixel_1  0.503014  0.391576  0.382041  0.652105
4   2001-01-05  pixel_1  0.688625  0.865679  0.828888  0.856478
..         ...      ...       ...       ...       ...       ...
727 2001-12-28  pixel_2  0.697661  0.725550  0.926735  0.527801
728 2001-12-29  pixel_2  0.052295  0.632843  0.536919  0.817767
729 2001-12-30  pixel_2  0.306129  0.426934  0.638589  0.697375
730 2001-12-31  pixel_2  0.633789  0.982248  0.255824  0.982358
731 2002-01-01  pixel_2  0.922154  0.088203  0.887233  0.700154

[732 rows x 6 columns]

## I want a dataframe in the following form: converting the four columns (["y_0", "y_1", "yhat_0", "yhat_1"]) into three (["horizon", "y", "yhat"]).

target_lookalike = pd.DataFrame({
    "initialisation_time": np.tile(np.array(index)[:, 1], 2),
    "pixel": np.tile(np.array(index)[:, 0], 2),
    "horizon": np.tile([0, 1], len(index)),
    "y": rng.random(len(index) * 2),
    "y_hat": rng.random(len(index) * 2),
})
     initialisation_time    pixel  horizon         y     y_hat
0             2001-01-01  pixel_1        0  0.833400  0.457637
1             2001-01-02  pixel_1        1  0.607682  0.302057
2             2001-01-03  pixel_1        0  0.474058  0.045079
3             2001-01-04  pixel_1        1  0.198236  0.586153
4             2001-01-05  pixel_1        0  0.047407  0.179364
...                  ...      ...      ...       ...       ...
1459          2001-12-28  pixel_2        1  0.157809  0.997542
1460          2001-12-29  pixel_2        0  0.659860  0.830327
1461          2001-12-30  pixel_2        1  0.047798  0.541301
1462          2001-12-31  pixel_2        0  0.002146  0.945190
1463          2002-01-01  pixel_2        1  0.636621  0.735699

[1464 rows x 5 columns]

Where the horizon is taken from the column string ("y_0" would be horizon == 0), and the values are assigned to either y or y_hat.

CodePudding user response:

Perhaps someone can find a neater way to achieve the same thing, but this should work:

# df0 is original data

df1 = df0[['time','pixel','y_0','yhat_0']].rename(columns={'y_0':'y','yhat_0':'yhat'})
df1['Horizon'] = 0

df2 = df0[['time','pixel','y_1','yhat_1']].rename(columns={'y_1':'y','yhat_1':'yhat'})
df2['Horizon'] = 1

df3 = df1.append(df2).sort_values('time')

I get the following output from df3:

        time    pixel         y      yhat  Horizon
0 2001-01-01  pixel_1  0.257135  0.989102        0
0 2001-01-01  pixel_1  0.609062  0.297949        1
1 2001-01-02  pixel_1  0.898808  0.966388        0
1 2001-01-02  pixel_1  0.509193  0.182610        1
2 2001-01-03  pixel_1  0.724221  0.455078        0
2 2001-01-03  pixel_1  0.537482  0.141172        1
3 2001-01-04  pixel_1  0.503014  0.382041        0
3 2001-01-04  pixel_1  0.391576  0.652105        1
4 2001-01-05  pixel_1  0.688625  0.828888        0
4 2001-01-05  pixel_1  0.865679  0.856478        1
5 2001-12-28  pixel_2  0.697661  0.926735        0
5 2001-12-28  pixel_2  0.725550  0.527801        1
6 2001-12-29  pixel_2  0.052295  0.536919        0
6 2001-12-29  pixel_2  0.632843  0.817767        1
7 2001-12-30  pixel_2  0.306129  0.638589        0
7 2001-12-30  pixel_2  0.426934  0.697375        1
8 2001-12-31  pixel_2  0.633789  0.255824        0
8 2001-12-31  pixel_2  0.982248  0.982358        1

It's slightly different from your output in that each date is displayed twice, but I think that's probably the desired output?

CodePudding user response:

The following function works:

def unstack_preds_wide_to_long(
    preds: pd.DataFrame
) -> pd.DataFrame:    
    # stack into LONG form
    df = preds.set_index(["pixel", "initialisation_time"]).stack().reset_index().rename({0: "value"}, axis=1)
    
    # explode "level_2" column into multiple columns (split on "_")
    df = df.join(df["level_2"].str.split("_", 1, expand=True).rename({0: "variable", 1: "horizon"}, axis=1)).drop(columns=["level_2"])
    
    # change column with variables to columns
    df = df.set_index(["pixel", "initialisation_time", "horizon"]).pivot(columns="variable")
    
    # drop the extra multi-index level in the columns
    df.columns = df.columns.droplevel()

    return df.reset_index()

and call it as so:

df = unstack_preds_wide_to_long(dat)

df
variable    pixel initialisation_time horizon         y      yhat
0         pixel_1          2001-01-01       0  0.756948  0.141300
1         pixel_1          2001-01-01       1  0.006226  0.654879
2         pixel_1          2001-01-02       0  0.941382  0.461212
3         pixel_1          2001-01-02       1  0.330925  0.278326
4         pixel_1          2001-01-03       0  0.592463  0.498596
...           ...                 ...     ...       ...       ...
1459      pixel_2          2001-12-30       1  0.092664  0.320106
1460      pixel_2          2001-12-31       0  0.076966  0.118248
1461      pixel_2          2001-12-31       1  0.733759  0.116228
1462      pixel_2          2002-01-01       0  0.287662  0.546546
1463      pixel_2          2002-01-01       1  0.635012  0.697661

[1464 rows x 5 columns]
  • Related