Is there a way to make a dataframe from a dictionary of lists of lists array?-CodePudding

I am currently trying to build a rolling window timeseries for Principal Component Analysis for stock returns to build a backtest. I want to see if setting the weights of the respective assets overtime perform better (stronger returns) than a buy and hold portfolio. The problem is its quite difficult building a timeseries for the components weights that correspond with the PCA. I came up with somewhat of a fix but cannot seem to build a timeseries for this data. I am also struggling replacing the key values in the dictionary with the datetime series. Have looked around, tried most on stack overflow but to no avail.

The below code is what I have come up with:

import numpy as np
np.random.seed(42)

values_for_df = []
for i in range(1,6):    
    random_numbers = np.random.random(size=60)
    values_for_df.append(random_numbers)

df = pd.DataFrame(values_for_df).T

weights = {}
dates_1 = {}
for i in range(1, len(df)):
    pca = PCA()
    transf = pca.fit_transform(df.iloc[i:i 2])
    weights[i] = pca.components_
    dates_1[i] = df.iloc[i].name

The output is a dictionary of lists of lists. As indicated, I am having a hard time turning this into a df using either pd.DataFrame() and pd.concat().

Anyway to turn this into output into a dataframe where the two PCA component weights rows correspond to a datetime?

The output of this code looks like this:

{1: array([[ 0.50938649,  0.1163777 , -0.56213712, -0.5999693 , -0.2258768 ],
        [-0.19623229,  0.68084356,  0.17894347,  0.05397575, -0.68044896]]),
 2: array([[ 0.76101188, -0.39708989, -0.35225525, -0.01460473, -0.37267074],
        [ 0.26603362, -0.44559758,  0.81939468,  0.00324688,  0.24341472]]),
 3: array([[ 0.43735771,  0.07284643, -0.23807945,  0.46456192, -0.72863711],
        [-0.84990214, -0.03839851,  0.14762177,  0.40008466, -0.30713514]]),
 4: array([[-0.10002177, -0.12908589,  0.09697811, -0.54718954,  0.81517565],
        [ 0.9291778 ,  0.24487735, -0.15042811, -0.23197187,  0.01497085]]),
 5: array([[ 0.43260558, -0.17245194, -0.15363331,  0.64845393, -0.58225171],
        [-0.8998753 , -0.03170306, -0.04644508,  0.3319358 , -0.27727395]]),
 6: array([[-0.66851419,  0.31545065, -0.26741055, -0.54749379,  0.28691779],
        [ 0.3598592 , -0.05698951, -0.01176088,  0.02137245,  0.93094492]]),
 7: array([[ 0.69949617, -0.46121291,  0.26456096,  0.47439289,  0.05428297],
        [ 0.0671515 , -0.02046416,  0.07459749, -0.04681467, -0.99363751]]),
 8: array([[ 0.76526418, -0.23880119, -0.57563869, -0.12170626, -0.10569961],
        [-0.20948119, -0.96814145,  0.11706768, -0.02831197,  0.06567612]]),
 9: array([[ 0.88308511, -0.18178186,  0.23418943,  0.05558346,  0.35941875],
        [ 0.3864688 , -0.20776523, -0.30713553, -0.12458004, -0.83523832]]),
 10: array([[ 0.02145911,  0.17212618, -0.34312327, -0.91962789,  0.08039307],
        [-0.93784872,  0.14547558,  0.22919403, -0.09705987, -0.19319965]]),
 11: array([[-0.28946201, -0.26603042,  0.62500451, -0.66932375,  0.08255082],
        [-0.79432192, -0.0826848 ,  0.20253363,  0.56666393,  0.0093821 ]]),
 12: array([[ 0.4225668 ,  0.63454067, -0.52748616,  0.37344672, -0.0330355 ],
        [ 0.89717194, -0.19965603,  0.28582373, -0.27012438,  0.02361333]]),
 13: array([[-0.09152907,  0.18236668, -0.43896889,  0.65056049, -0.5851856 ],
        [ 0.19225542,  0.02507023, -0.12112356, -0.68443942, -0.69230131]]),
 14: array([[ 0.52763656,  0.65909855, -0.10621454, -0.26420703,  0.45398444],
        [-0.20903038, -0.39874697,  0.03275961,  0.0985442 ,  0.88686132]]),
 15: array([[-0.6376942 , -0.65434659,  0.23591625, -0.20141987,  0.26258372],
        [-0.26207514, -0.31149866, -0.55752568,  0.44580663, -0.56983048]]),
 16: array([[ 0.27907902,  0.33000177, -0.37818218, -0.21758258, -0.78920833],
        [ 0.49977863, -0.51086522,  0.39388011, -0.57407273, -0.06735727]]),
 17: array([[-0.07747888, -0.44363775,  0.72389959,  0.51430407,  0.09296923],
        [-0.44632809, -0.37360701, -0.37433274,  0.2591633 , -0.67373468]]),
 18: array([[-0.24853706, -0.28143494, -0.09349904,  0.91280228,  0.13066607],
        [-0.90863048, -0.25882281,  0.08144532, -0.30767452, -0.07813102]]),
 19: array([[-0.0499767 , -0.46808766,  0.81593976,  0.32495903,  0.08390597],
        [-0.18009682,  0.19879004, -0.0864013 ,  0.65630871, -0.69988667]]),
 20: array([[-0.15978936,  0.40505628,  0.23403331,  0.27166524,  0.82572585],
        [-0.82190218, -0.11639043, -0.04382051, -0.54840546,  0.09089168]]),
 21: array([[-0.59793074,  0.36403396,  0.28523106, -0.56614702,  0.32875356],
        [ 0.08787018, -0.09763207,  0.94862929,  0.27707493, -0.07796638]]),
 22: array([[-0.04762231, -0.48706884,  0.45248363,  0.37215567, -0.64595262],
        [ 0.44614193,  0.47456984,  0.55381454, -0.44821569, -0.26102299]]),
 23: array([[-6.14200977e-01, -8.29742681e-02,  1.70228332e-01,
         -7.64025699e-01,  5.62092342e-02],
        [ 9.85466281e-02, -7.29776513e-01, -4.55585357e-04,
         -4.97073250e-02, -6.74717553e-01]]),

When attempting to create a df, I get this:

    weights_keys    weights_values
0   1   [[0.5093864920875057, 0.11637769781544054, -0....
1   2   [[0.7610118804227364, -0.3970898897595845, -0....
2   3   [[0.43735770537072516, 0.07284642654346118, -0...
3   4   [[-0.100021766544103, -0.12908589345836016, 0....
4   5   [[0.43260557607788175, -0.17245193633756645, -...
5   6   [[-0.6685141891902584, 0.3154506469430627, -0....
6   7   [[0.6994961703309339, -0.4612129082876791, 0.2...
7   8   [[0.7652641817892236, -0.23880119387494167, -0...
8   9   [[0.8830851102283364, -0.18178185688401122, 0....
9   10  [[0.02145910731659373, 0.17212617677552292, -0...
10  11  [[-0.28946201366547714, -0.2660304245115253, 0...
11  12  [[0.42256679812505826, 0.6345406677421921, -0....
12  13  [[-0.09152906655393278, 0.1823666758882022, -0...
13  14  [[0.5276365649456491, 0.6590985509896493, -0.1...
14  15  [[-0.6376941956390323, -0.6543465915749572, 0....
15  16  [[0.27907901752772, 0.33000177354673366, -0.37...
16  17  [[-0.07747887772273652, -0.44363774912889514, ...

An example of what the dataframe should look like is this:

        USDJPY  EURUSD  GBPUSD  AUDUSD  GBPAUD
20210924 21:00:00   Component weights 1 1.618764e-09    -5.137869e-10   -7.915763e-10   -6.841845e-10   4.352906e-10
Component weights 2 -5.137869e-10   1.900899e-09    9.721030e-10    1.872090e-09    -4.564939e-10
Component weights 3 -7.915763e-10   9.721030e-10    3.363203e-09    3.988530e-09    9.450517e-10
Component weights 4 -6.841845e-10   1.872090e-09    3.988530e-09    1.277432e-08    -2.272119e-09
Component weights 5 4.352906e-10    -4.564939e-10   9.450517e-10    -2.272119e-09   7.960307e-09
... ... ... ... ... ... ...
20210924 21:59:00   Component weights 1 1.618764e-09    -5.137869e-10   -7.915763e-10   -6.841845e-10   4.352906e-10
Component weights 2 -5.137869e-10   1.900899e-09    9.721030e-10    1.872090e-09    -4.564939e-10
Component weights 3 -7.915763e-10   9.721030e-10    3.363203e-09    3.988530e-09    9.450517e-10
Component weights 4 -6.841845e-10   1.872090e-09    3.988530e-09    1.277432e-08    -2.272119e-09
Component weights 5 4.352906e-10    -4.564939e-10   9.450517e-10    -2.272119e-09   7.960307e-09

The above df is an example of a PCA created with n_components = 5

CodePudding user response：

It is not clear what the final output looks like. I am taking an guess.

weights = {}
dates_1 = {}
for i in range(1, len(df)):
    pca = PCA()
    transf = pca.fit_transform(df.iloc[i:i 2])
    weights[i] = pca.components_.tolist()
    dates_1[i] = df.iloc[i].name

df1 = pd.DataFrame(dates_1.items(), columns=['dates_keys', 'dates_values'])
df2 = pd.DataFrame(weights.items(), columns=['weights_keys', 'weights_values'])

df = df1.merge(df2, left_on='dates_keys', right_on='weights_keys')
df[['pca1', 'pca2']] = pd.DataFrame(df['weights_values'].tolist())
df.drop('weights_values', axis=1, inplace=True)
print(df.head(2))

Does this solve your problem?

CodePudding user response：

Following @HoneyBeer's response above, a df can be created as below:

df3 = []
for i in range(0, len(weights)):
    new_df = pd.DataFrame(df['weights_values'][i].tolist())
    df3.append(new_df)

final_df = pd.concat(df3, keys=returns.index).rename(index={0:'Component weights 1',                           
                                                   1: 'Component weights 2'}), columns={0:'USDJPY',
                                                                                       1: 'EURUSD', 
                                                                                       2: 'GBPUSD',
                                                                                       3: 'AUDUSD',
                                                                                   4: 'GBPAUD'})

The result is this:

        USDJPY  EURUSD  GBPUSD  AUDUSD  GBPAUD
Date                        
20210924 21:00:00   Component weights 1 -0.138952   -0.149062   0.547648    -0.264848   0.767079
Component weights 2 -0.934455   0.048407    0.125520    -0.140824   -0.298100
20210924 21:01:00   Component weights 1 0.149391    0.255187    -0.094000   -0.653122   0.690766
Component weights 2 0.427402    -0.215456   0.255242    -0.621257   -0.565506
20210924 21:02:00   Component weights 1 -0.214539   0.192370    -0.134088   0.146269    -0.936799
... ... ... ... ... ... ...
20210924 21:56:00   Component weights 1 0.002072    0.409711    -0.598962   -0.486351   -0.486662
Component weights 2 -0.079410   0.416419    -0.490364   0.726674    0.227546
20210924 21:57:00   Component weights 1 -0.287978   -0.138368   0.623330    0.679409    0.218598
Component weights 2 0.060904    0.070058    0.550906    -0.206938   -0.803157
20210924 21:58:00   Component weights 1 1.000000    0.000000    0.000000    0.000000    0.000000

or in picture form:

Weights in picture form

CodePudding user response：

Here is what I think you are trying to do. You have a timeseries consisting of 60 sampling intervals. For the purposes of this answer, I will assume the interval is 1 day, so you have 60 days in the timeseries. I also think you have 5 data columns for the timeseries. So your input is something like

date	var1	var2	var3	var4	var5
2018-04-24	1.1	2.2	2.5	3.5	3.3
2018-04-25	1.0	2.3	3.9	8.7	2.7
2018-04-26	0.9	2.7	4.0	6.5	4.6

You are then calculating principal components for a sliding 2-day window. You want to combine all of these PCA results into a single data frame.

For combining all PCA results together, you can use a MultiIndex

Here is a full working example.

import numpy as np                                                                                               
import pandas as pd
from sklearn.decomposition import PCA
rng = np.random.default_rng(42)

values_for_df = []
n_dates = 60
window_length = 2
n_columns = 5
n_components = 5
for i in range(n_columns):
    random_numbers = rng.random(size=n_dates)
    values_for_df.append(random_numbers)

df = pd.DataFrame(values_for_df).T
dates = pd.date_range(start="2018-04-24", periods=n_dates-1)
pca_component_labels = [f"weights_{i 1}" for i in range(window_length)]
my_index = pd.MultiIndex.from_product(
    (dates, pca_component_labels),
    names=["date", "pca_component"]
)
weights = []
for i in range(n_dates - 1):
    pca = PCA()
    transf = pca.fit_transform(df.iloc[i:i 2])
    weights.append(pca.components_)

timeseries_pca = pd.DataFrame(
    np.concatenate(weights),
    index=my_index,
    columns=[f"PCA{i 1}" for i in range(n_components)]
)
timeseries_pca

This is the result.

                              PCA1      PCA2      PCA3      PCA4      PCA5
date       pca_component
2018-04-24 weights_1     -0.846738 -0.498592  0.166146  0.006707 -0.082409
           weights_2     -0.530398  0.817748 -0.206418 -0.009032  0.085296
2018-04-25 weights_1      0.427636  0.095916 -0.576067 -0.381654 -0.574818
           weights_2      0.875506 -0.012782  0.117490  0.117209  0.453634
2018-04-26 weights_1      0.291577 -0.361262 -0.599255  0.366988 -0.539153
...                            ...       ...       ...       ...       ...
2018-06-19 weights_2      0.468190  0.773792 -0.259101  0.301738 -0.154483
2018-06-20 weights_1      0.461097  0.308911  0.108246  0.824414  0.024239
           weights_2      0.172187  0.429880  0.585669 -0.317087 -0.584809
2018-06-21 weights_1     -0.085999  0.664442 -0.486954  0.310776 -0.466278
           weights_2     -0.128612  0.040888  0.719771 -0.016595 -0.680765

[118 rows x 5 columns]