I am currently trying to build a rolling window timeseries for Principal Component Analysis for stock returns to build a backtest. I want to see if setting the weights of the respective assets overtime perform better (stronger returns) than a buy and hold portfolio. The problem is its quite difficult building a timeseries for the components weights that correspond with the PCA. I came up with somewhat of a fix but cannot seem to build a timeseries for this data. I am also struggling replacing the key values in the dictionary with the datetime series. Have looked around, tried most on stack overflow but to no avail.
The below code is what I have come up with:
import numpy as np
np.random.seed(42)
values_for_df = []
for i in range(1,6):
random_numbers = np.random.random(size=60)
values_for_df.append(random_numbers)
df = pd.DataFrame(values_for_df).T
weights = {}
dates_1 = {}
for i in range(1, len(df)):
pca = PCA()
transf = pca.fit_transform(df.iloc[i:i 2])
weights[i] = pca.components_
dates_1[i] = df.iloc[i].name
The output is a dictionary of lists of lists. As indicated, I am having a hard time turning this into a df using either pd.DataFrame()
and pd.concat()
.
Anyway to turn this into output into a dataframe where the two PCA component weights rows correspond to a datetime?
The output of this code looks like this:
{1: array([[ 0.50938649, 0.1163777 , -0.56213712, -0.5999693 , -0.2258768 ],
[-0.19623229, 0.68084356, 0.17894347, 0.05397575, -0.68044896]]),
2: array([[ 0.76101188, -0.39708989, -0.35225525, -0.01460473, -0.37267074],
[ 0.26603362, -0.44559758, 0.81939468, 0.00324688, 0.24341472]]),
3: array([[ 0.43735771, 0.07284643, -0.23807945, 0.46456192, -0.72863711],
[-0.84990214, -0.03839851, 0.14762177, 0.40008466, -0.30713514]]),
4: array([[-0.10002177, -0.12908589, 0.09697811, -0.54718954, 0.81517565],
[ 0.9291778 , 0.24487735, -0.15042811, -0.23197187, 0.01497085]]),
5: array([[ 0.43260558, -0.17245194, -0.15363331, 0.64845393, -0.58225171],
[-0.8998753 , -0.03170306, -0.04644508, 0.3319358 , -0.27727395]]),
6: array([[-0.66851419, 0.31545065, -0.26741055, -0.54749379, 0.28691779],
[ 0.3598592 , -0.05698951, -0.01176088, 0.02137245, 0.93094492]]),
7: array([[ 0.69949617, -0.46121291, 0.26456096, 0.47439289, 0.05428297],
[ 0.0671515 , -0.02046416, 0.07459749, -0.04681467, -0.99363751]]),
8: array([[ 0.76526418, -0.23880119, -0.57563869, -0.12170626, -0.10569961],
[-0.20948119, -0.96814145, 0.11706768, -0.02831197, 0.06567612]]),
9: array([[ 0.88308511, -0.18178186, 0.23418943, 0.05558346, 0.35941875],
[ 0.3864688 , -0.20776523, -0.30713553, -0.12458004, -0.83523832]]),
10: array([[ 0.02145911, 0.17212618, -0.34312327, -0.91962789, 0.08039307],
[-0.93784872, 0.14547558, 0.22919403, -0.09705987, -0.19319965]]),
11: array([[-0.28946201, -0.26603042, 0.62500451, -0.66932375, 0.08255082],
[-0.79432192, -0.0826848 , 0.20253363, 0.56666393, 0.0093821 ]]),
12: array([[ 0.4225668 , 0.63454067, -0.52748616, 0.37344672, -0.0330355 ],
[ 0.89717194, -0.19965603, 0.28582373, -0.27012438, 0.02361333]]),
13: array([[-0.09152907, 0.18236668, -0.43896889, 0.65056049, -0.5851856 ],
[ 0.19225542, 0.02507023, -0.12112356, -0.68443942, -0.69230131]]),
14: array([[ 0.52763656, 0.65909855, -0.10621454, -0.26420703, 0.45398444],
[-0.20903038, -0.39874697, 0.03275961, 0.0985442 , 0.88686132]]),
15: array([[-0.6376942 , -0.65434659, 0.23591625, -0.20141987, 0.26258372],
[-0.26207514, -0.31149866, -0.55752568, 0.44580663, -0.56983048]]),
16: array([[ 0.27907902, 0.33000177, -0.37818218, -0.21758258, -0.78920833],
[ 0.49977863, -0.51086522, 0.39388011, -0.57407273, -0.06735727]]),
17: array([[-0.07747888, -0.44363775, 0.72389959, 0.51430407, 0.09296923],
[-0.44632809, -0.37360701, -0.37433274, 0.2591633 , -0.67373468]]),
18: array([[-0.24853706, -0.28143494, -0.09349904, 0.91280228, 0.13066607],
[-0.90863048, -0.25882281, 0.08144532, -0.30767452, -0.07813102]]),
19: array([[-0.0499767 , -0.46808766, 0.81593976, 0.32495903, 0.08390597],
[-0.18009682, 0.19879004, -0.0864013 , 0.65630871, -0.69988667]]),
20: array([[-0.15978936, 0.40505628, 0.23403331, 0.27166524, 0.82572585],
[-0.82190218, -0.11639043, -0.04382051, -0.54840546, 0.09089168]]),
21: array([[-0.59793074, 0.36403396, 0.28523106, -0.56614702, 0.32875356],
[ 0.08787018, -0.09763207, 0.94862929, 0.27707493, -0.07796638]]),
22: array([[-0.04762231, -0.48706884, 0.45248363, 0.37215567, -0.64595262],
[ 0.44614193, 0.47456984, 0.55381454, -0.44821569, -0.26102299]]),
23: array([[-6.14200977e-01, -8.29742681e-02, 1.70228332e-01,
-7.64025699e-01, 5.62092342e-02],
[ 9.85466281e-02, -7.29776513e-01, -4.55585357e-04,
-4.97073250e-02, -6.74717553e-01]]),
When attempting to create a df, I get this:
weights_keys weights_values
0 1 [[0.5093864920875057, 0.11637769781544054, -0....
1 2 [[0.7610118804227364, -0.3970898897595845, -0....
2 3 [[0.43735770537072516, 0.07284642654346118, -0...
3 4 [[-0.100021766544103, -0.12908589345836016, 0....
4 5 [[0.43260557607788175, -0.17245193633756645, -...
5 6 [[-0.6685141891902584, 0.3154506469430627, -0....
6 7 [[0.6994961703309339, -0.4612129082876791, 0.2...
7 8 [[0.7652641817892236, -0.23880119387494167, -0...
8 9 [[0.8830851102283364, -0.18178185688401122, 0....
9 10 [[0.02145910731659373, 0.17212617677552292, -0...
10 11 [[-0.28946201366547714, -0.2660304245115253, 0...
11 12 [[0.42256679812505826, 0.6345406677421921, -0....
12 13 [[-0.09152906655393278, 0.1823666758882022, -0...
13 14 [[0.5276365649456491, 0.6590985509896493, -0.1...
14 15 [[-0.6376941956390323, -0.6543465915749572, 0....
15 16 [[0.27907901752772, 0.33000177354673366, -0.37...
16 17 [[-0.07747887772273652, -0.44363774912889514, ...
An example of what the dataframe should look like is this:
USDJPY EURUSD GBPUSD AUDUSD GBPAUD
20210924 21:00:00 Component weights 1 1.618764e-09 -5.137869e-10 -7.915763e-10 -6.841845e-10 4.352906e-10
Component weights 2 -5.137869e-10 1.900899e-09 9.721030e-10 1.872090e-09 -4.564939e-10
Component weights 3 -7.915763e-10 9.721030e-10 3.363203e-09 3.988530e-09 9.450517e-10
Component weights 4 -6.841845e-10 1.872090e-09 3.988530e-09 1.277432e-08 -2.272119e-09
Component weights 5 4.352906e-10 -4.564939e-10 9.450517e-10 -2.272119e-09 7.960307e-09
... ... ... ... ... ... ...
20210924 21:59:00 Component weights 1 1.618764e-09 -5.137869e-10 -7.915763e-10 -6.841845e-10 4.352906e-10
Component weights 2 -5.137869e-10 1.900899e-09 9.721030e-10 1.872090e-09 -4.564939e-10
Component weights 3 -7.915763e-10 9.721030e-10 3.363203e-09 3.988530e-09 9.450517e-10
Component weights 4 -6.841845e-10 1.872090e-09 3.988530e-09 1.277432e-08 -2.272119e-09
Component weights 5 4.352906e-10 -4.564939e-10 9.450517e-10 -2.272119e-09 7.960307e-09
The above df is an example of a PCA created with n_components = 5
CodePudding user response:
It is not clear what the final output looks like. I am taking an guess.
weights = {}
dates_1 = {}
for i in range(1, len(df)):
pca = PCA()
transf = pca.fit_transform(df.iloc[i:i 2])
weights[i] = pca.components_.tolist()
dates_1[i] = df.iloc[i].name
df1 = pd.DataFrame(dates_1.items(), columns=['dates_keys', 'dates_values'])
df2 = pd.DataFrame(weights.items(), columns=['weights_keys', 'weights_values'])
df = df1.merge(df2, left_on='dates_keys', right_on='weights_keys')
df[['pca1', 'pca2']] = pd.DataFrame(df['weights_values'].tolist())
df.drop('weights_values', axis=1, inplace=True)
print(df.head(2))
Does this solve your problem?
CodePudding user response:
Following @HoneyBeer's response above, a df can be created as below:
df3 = []
for i in range(0, len(weights)):
new_df = pd.DataFrame(df['weights_values'][i].tolist())
df3.append(new_df)
final_df = pd.concat(df3, keys=returns.index).rename(index={0:'Component weights 1',
1: 'Component weights 2'}), columns={0:'USDJPY',
1: 'EURUSD',
2: 'GBPUSD',
3: 'AUDUSD',
4: 'GBPAUD'})
The result is this:
USDJPY EURUSD GBPUSD AUDUSD GBPAUD
Date
20210924 21:00:00 Component weights 1 -0.138952 -0.149062 0.547648 -0.264848 0.767079
Component weights 2 -0.934455 0.048407 0.125520 -0.140824 -0.298100
20210924 21:01:00 Component weights 1 0.149391 0.255187 -0.094000 -0.653122 0.690766
Component weights 2 0.427402 -0.215456 0.255242 -0.621257 -0.565506
20210924 21:02:00 Component weights 1 -0.214539 0.192370 -0.134088 0.146269 -0.936799
... ... ... ... ... ... ...
20210924 21:56:00 Component weights 1 0.002072 0.409711 -0.598962 -0.486351 -0.486662
Component weights 2 -0.079410 0.416419 -0.490364 0.726674 0.227546
20210924 21:57:00 Component weights 1 -0.287978 -0.138368 0.623330 0.679409 0.218598
Component weights 2 0.060904 0.070058 0.550906 -0.206938 -0.803157
20210924 21:58:00 Component weights 1 1.000000 0.000000 0.000000 0.000000 0.000000
or in picture form:
CodePudding user response:
Here is what I think you are trying to do. You have a timeseries consisting of 60 sampling intervals. For the purposes of this answer, I will assume the interval is 1 day, so you have 60 days in the timeseries. I also think you have 5 data columns for the timeseries. So your input is something like
date | var1 | var2 | var3 | var4 | var5 |
---|---|---|---|---|---|
2018-04-24 | 1.1 | 2.2 | 2.5 | 3.5 | 3.3 |
2018-04-25 | 1.0 | 2.3 | 3.9 | 8.7 | 2.7 |
2018-04-26 | 0.9 | 2.7 | 4.0 | 6.5 | 4.6 |
You are then calculating principal components for a sliding 2-day window. You want to combine all of these PCA results into a single data frame.
For combining all PCA results together, you can use a MultiIndex
Here is a full working example.
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
rng = np.random.default_rng(42)
values_for_df = []
n_dates = 60
window_length = 2
n_columns = 5
n_components = 5
for i in range(n_columns):
random_numbers = rng.random(size=n_dates)
values_for_df.append(random_numbers)
df = pd.DataFrame(values_for_df).T
dates = pd.date_range(start="2018-04-24", periods=n_dates-1)
pca_component_labels = [f"weights_{i 1}" for i in range(window_length)]
my_index = pd.MultiIndex.from_product(
(dates, pca_component_labels),
names=["date", "pca_component"]
)
weights = []
for i in range(n_dates - 1):
pca = PCA()
transf = pca.fit_transform(df.iloc[i:i 2])
weights.append(pca.components_)
timeseries_pca = pd.DataFrame(
np.concatenate(weights),
index=my_index,
columns=[f"PCA{i 1}" for i in range(n_components)]
)
timeseries_pca
This is the result.
PCA1 PCA2 PCA3 PCA4 PCA5
date pca_component
2018-04-24 weights_1 -0.846738 -0.498592 0.166146 0.006707 -0.082409
weights_2 -0.530398 0.817748 -0.206418 -0.009032 0.085296
2018-04-25 weights_1 0.427636 0.095916 -0.576067 -0.381654 -0.574818
weights_2 0.875506 -0.012782 0.117490 0.117209 0.453634
2018-04-26 weights_1 0.291577 -0.361262 -0.599255 0.366988 -0.539153
... ... ... ... ... ...
2018-06-19 weights_2 0.468190 0.773792 -0.259101 0.301738 -0.154483
2018-06-20 weights_1 0.461097 0.308911 0.108246 0.824414 0.024239
weights_2 0.172187 0.429880 0.585669 -0.317087 -0.584809
2018-06-21 weights_1 -0.085999 0.664442 -0.486954 0.310776 -0.466278
weights_2 -0.128612 0.040888 0.719771 -0.016595 -0.680765
[118 rows x 5 columns]