I have the following data structure, with Stocks S, having features f:
year S1_f1 S1_f2 S2_f1 S2_f2 S3_f1 S3_f2 Sn_f1 Sn_f2
2011 0.1 0.4 0.12 0.42 0.2 0.5 n n
2012 0.4 0.7 0.42 0.72 0.5 0.8 n n
2013 0.7 0.9 0.72 0.5 0.8 0.9 n n
n n n n n n n n n
My original df has 10 observations but 50k predictors - so I want to generate more balance on the observation side.
Hence, I want to have the following dataframe:
year S1_f1 S1_f2 S2_f1 S2_f2 S3_f1 S3_f2 Sn_f1 Sn_f2
2011 0.1 0.4 0 0 0 0 0 0
2012 0.4 0.7 0 0 0 0 0 0
2013 0.7 0.9 0 0 0 0 0 0
2011 0 0 0.12 0.42 0 0 0 0
2012 0 0 0.42 0.72 0 0 0 0
2013 0 0 0.72 0.5 0 0 0 0
2011 0 0 0 0 0.2 0.5 0 0
2012 0 0 0 0 0.5 0.8 0 0
2013 0 0 0 0 0.8 0.9 0 0
n 0 0 0 0 0 0 n n
...and so on (example values).
I want to artificially multiply my timestamps via this approach.
Is there an elegant way to do this?
CodePudding user response:
You can convert what you have into what you want using the following code:
library(data.table)
dcast(
melt(setDT(s), id="year")[, grp:=gsub("_.*$","",variable)],
year grp~variable,
value.var="value"
)[order(grp,year)]
Output:
year grp S1_f1 S1_f2 S2_f1 S2_f2 S3_f1 S3_f2
<int> <char> <num> <num> <num> <num> <num> <num>
1: 2011 S1 0.1 0.4 NA NA NA NA
2: 2012 S1 0.4 0.7 NA NA NA NA
3: 2013 S1 0.7 0.9 NA NA NA NA
4: 2011 S2 NA NA 0.12 0.42 NA NA
5: 2012 S2 NA NA 0.42 0.72 NA NA
6: 2013 S2 NA NA 0.72 0.50 NA NA
7: 2011 S3 NA NA NA NA 0.2 0.5
8: 2012 S3 NA NA NA NA 0.5 0.8
9: 2013 S3 NA NA NA NA 0.8 0.9
Input:
structure(list(year = 2011:2013, S1_f1 = c(0.1, 0.4, 0.7), S1_f2 = c(0.4,
0.7, 0.9), S2_f1 = c(0.12, 0.42, 0.72), S2_f2 = c(0.42, 0.72,
0.5), S3_f1 = c(0.2, 0.5, 0.8), S3_f2 = c(0.5, 0.8, 0.9)), row.names = c(NA,
-3L), class = "data.frame")
CodePudding user response:
One possible way o solve your problem (note that I did not convert the data, say df
, into a data.table
):
library(data.table)
result = sub("^S(\\d) _.*", "\\1", names(df)[-1]) |>
unique() |>
lapply(function(i) df[sprintf(c("year", "S%s_f1", "S%s_f2"), i)]) |>
rbindlist(use.names=TRUE, fill=TRUE) |>
setnafill(fill=0)
year S1_f1 S1_f2 S2_f1 S2_f2 S3_f1 S3_f2
<int> <num> <num> <num> <num> <num> <num>
1: 2011 0.1 0.4 0.00 0.00 0.0 0.0
2: 2012 0.4 0.7 0.00 0.00 0.0 0.0
3: 2013 0.7 0.9 0.00 0.00 0.0 0.0
4: 2011 0.0 0.0 0.12 0.42 0.0 0.0
5: 2012 0.0 0.0 0.42 0.72 0.0 0.0
6: 2013 0.0 0.0 0.72 0.50 0.0 0.0
7: 2011 0.0 0.0 0.00 0.00 0.2 0.5
8: 2012 0.0 0.0 0.00 0.00 0.5 0.8
9: 2013 0.0 0.0 0.00 0.00 0.8 0.9
CodePudding user response:
Using the sample data frame DF
defined reproducibly in the Note at the end, create a vector g
defining a grouping of the columns which is in the case of
the example equals c("S1", "S1", "S2", "S2", "S3", "S3")
. Then use it to
split the columns into a list of matrices L
, one matrix for each level of
g
. Apply .bdiag
from the Matrix package to that list to create a block diagonal matrix and insert the year column and set the column names. Note that the Matrix package comes with R and does not have to be installed so this only uses base R.
library(Matrix)
g <- sub("_.*", "", names(DF)[-1])
L <- tapply(as.list(DF[-1]), g, function(x) as.matrix(as.data.frame(x)))
setNames(data.frame(DF$year, as.matrix(.bdiag(L))), names(DF))
giving:
year S1_f1 S1_f2 S2_f1 S2_f2 S3_f1 S3_f2
1 2011 0.1 0.4 0.00 0.00 0.0 0.0
2 2012 0.4 0.7 0.00 0.00 0.0 0.0
3 2013 0.7 0.9 0.00 0.00 0.0 0.0
4 2011 0.0 0.0 0.12 0.42 0.0 0.0
5 2012 0.0 0.0 0.42 0.72 0.0 0.0
6 2013 0.0 0.0 0.72 0.50 0.0 0.0
7 2011 0.0 0.0 0.00 0.00 0.2 0.5
8 2012 0.0 0.0 0.00 0.00 0.5 0.8
9 2013 0.0 0.0 0.00 0.00 0.8 0.9
Note
Lines <- "
year S1_f1 S1_f2 S2_f1 S2_f2 S3_f1 S3_f2
2011 0.1 0.4 0.12 0.42 0.2 0.5
2012 0.4 0.7 0.42 0.72 0.5 0.8
2013 0.7 0.9 0.72 0.5 0.8 0.9"
DF <- read.table(text = Lines, header = TRUE)