I am trying to add a new column slope
to a dataframe in R where the data represents groups of process codes and the numeric column represents a measure. I need to calculate the slope within each process code, starting at the earliest date through the most recent, then the next earliest date through the (same) most recent date, etc. This needs to be done for all groups. My current working code is below, however it is way too inefficient as it will be calculated on the fly within a shiny app.
I researched vectorization, but I am not sure how I could do that with this particular loop. Any help is appreciated. I am open to using any data structure (dataframe, data.table, tibble, etc.).
Data
process_code <- c(1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100 )
date <- c("1/1/2014", "2/1/2014", "3/1/2014", "4/1/2014", "5/1/2014", "6/1/2014", "7/1/2014", "8/1/2014", "9/1/2014", "10/1/2014", "11/1/2014", "12/1/2014", "1/1/2015", "2/1/2015", "3/1/2015", "4/1/2015", "5/1/2015", "6/1/2015", "7/1/2015", "8/1/2015", "9/1/2015", "10/1/2015", "11/1/2015", "12/1/2015", "1/1/2014", "2/1/2014", "3/1/2014", "4/1/2014", "5/1/2014", "6/1/2014", "7/1/2014", "8/1/2014", "9/1/2014", "10/1/2014", "11/1/2014", "12/1/2014", "1/1/2015", "2/1/2015", "3/1/2015", "4/1/2015", "5/1/2015", "6/1/2015", "7/1/2015", "8/1/2015", "9/1/2015", "10/1/2015", "11/1/2015", "12/1/2015", "1/1/2014", "2/1/2014", "3/1/2014", "4/1/2014", "5/1/2014", "6/1/2014", "7/1/2014", "8/1/2014", "9/1/2014", "10/1/2014", "11/1/2014", "12/1/2014", "1/1/2015", "2/1/2015", "3/1/2015", "4/1/2015", "5/1/2015", "6/1/2015", "7/1/2015", "8/1/2015", "9/1/2015", "10/1/2015", "11/1/2015", "12/1/2015", "1/1/2014", "2/1/2014", "3/1/2014", "4/1/2014", "5/1/2014", "6/1/2014", "7/1/2014", "8/1/2014", "9/1/2014", "10/1/2014", "11/1/2014", "12/1/2014", "1/1/2015", "2/1/2015", "3/1/2015", "4/1/2015", "5/1/2015", "6/1/2015", "7/1/2015", "8/1/2015", "9/1/2015", "10/1/2015", "11/1/2015", "12/1/2015")
measure <- c(23.91, 6.37, 5.51, 2.78, 3.84, 1.78, 2.46, 18.34, 3.04, 2.18, 5.03, 15.20, 4.41, 9.31, 14.30, 3.97, 9.03, 17.26, 16.14, 1.72, 0.36, 4.93, 7.83, 24.91, 3.28, 3.24, 37.39, 12.78, 2.32, 0.42, 1.02, 1.06, 6.97, 1.58, 7.89, 0.98, 6.87, 1.64, 13.32, 4.79, 3.04, 0.06, 28.32, 61.10, 51.29, 1.74, 6.35, 15.64, 0.76, 1.80, 38.13, 20.81, 39.23, 0.54, 3.50, 14.88, 1.30, 3.64, 39.71, 19.42, 2.98, 24.49, 19.10, 43.34, 47.90, 0.06, 27.92, 9.41, 74.80, 3.30, 16.26, 8.75, 1.72, 0.72, 0.28, 2.48, 34.28, 23.91, 6.37, 5.51, 2.78, 3.84, 1.78, 2.46, 18.34, 3.04, 2.18, 5.03, 15.20, 4.41, 9.31, 14.30, 3.97, 9.03, 17.26, 16.14)
df <- data.frame(process_code, date, measure)
To be just a bit more clear, I need the slope from a linear model for measures within each group (1000, 2000, 3000, 3100) where the new slope
field for each row represents the coefficient for a lm of that row through the last date for that group.
Example output:
process_code date measure slope
1000 1/1/2014 23.91 0.004358691 (calculated on rows 1:24)
1000 2/1/2014 6.37 0.010929872 (calculated on rows 2:24)
1000 3/1/2014 5.51 0.011844316 (calculated on rows 3:24)
1000 4/1/2014 2.78 0.012493401 (calculated on rows 4:24)
1000 5/1/2014 3.84 0.011738559 (calculated on rows 5:24)
1000 6/1/2014 1.78 0.011125547 (calculated on rows 6:24)
1000 7/1/2014 2.46 0.008732907 (calculated on rows 7:24)
1000 8/1/2014 18.34 0.005684154 (calculated on rows 8:24)
1000 9/1/2014 3.04 0.014278285 (calculated on rows 9:24)
1000 10/1/2014 2.18 0.011992905 (calculated on rows 10:24)
1000 11/1/2014 5.03 0.007247907 (calculated on rows 11:24)
1000 12/1/2014 15.2 0.003286292 (calculated on rows 12:24)
1000 1/1/2015 4.41 0.012013625 (calculated on rows 13:24)
1000 2/1/2015 9.31 0.006491157 (calculated on rows 14:24)
1000 3/1/2015 14.3 0.007156341 (calculated on rows 15:24)
1000 4/1/2015 3.97 0.021458616 (calculated on rows 16:24)
1000 5/1/2015 9.03 0.011040557 (calculated on rows 17:24)
1000 6/1/2015 17.26 0.010767026 (calculated on rows 18:24)
1000 7/1/2015 16.14 0.061592437 (calculated on rows 19:24)
1000 8/1/2015 1.72 0.176137292 (calculated on rows 20:24)
1000 9/1/2015 0.36 0.251455313 (calculated on rows 21:24)
1000 10/1/2015 4.93 0.326241491 (calculated on rows 22:24)
1000 11/1/2015 7.83 0.569333333 (calculated on rows 23:24)
1000 12/1/2015 24.91 NA (calculated on rows 24:24)
2000 1/1/2014 3.28 0.022962316
2000 2/1/2014 3.24 0.022929315
2000 3/1/2014 37.39 0.022568032
2000 4/1/2014 12.78 0.037857024
2000 5/1/2014 2.32 0.044812372
2000 6/1/2014 0.42 0.047406643
2000 7/1/2014 1.02 0.048790496
2000 8/1/2014 1.06 0.050101068
2000 9/1/2014 6.97 0.050721253
2000 10/1/2014 1.58 0.055695014
2000 11/1/2014 7.89 0.055476477
2000 12/1/2014 0.98 0.060930758
2000 1/1/2015 6.87 0.056568831
2000 2/1/2015 1.64 0.056762249
2000 3/1/2015 13.32 0.042076169
2000 4/1/2015 4.79 0.043574206
2000 5/1/2015 3.04 0.01200964
2000 6/1/2015 0.06 -0.065807749
2000 7/1/2015 28.32 -0.257910924
2000 8/1/2015 61.1 -0.445256697
2000 9/1/2015 51.29 -0.335559403
2000 10/1/2015 1.74 0.227429237
2000 11/1/2015 6.35 0.309666667
2000 12/1/2015 15.64 NA
3000 1/1/2014 0.76 0.017380462
3000 2/1/2014 1.8 0.012550749
3000 3/1/2014 38.13 0.006578873
3000 4/1/2014 20.81 0.015682204
3000 5/1/2014 39.23 0.018549388
3000 6/1/2014 0.54 0.032761603
3000 7/1/2014 3.5 0.026633041
3000 8/1/2014 14.88 0.019617614
3000 9/1/2014 1.3 0.018499056
3000 10/1/2014 3.64 0.003585714
3000 11/1/2014 39.71 -0.016315323
3000 12/1/2014 19.42 -0.00068808
3000 1/1/2015 2.98 -0.006072538
3000 2/1/2015 24.49 -0.043833615
3000 3/1/2015 19.1 -0.059179313
3000 4/1/2015 43.34 -0.097751338
3000 5/1/2015 47.9 -0.077973318
3000 6/1/2015 0.06 -0.003093107
3000 7/1/2015 27.92 -0.135403423
3000 8/1/2015 9.41 -0.193780797
3000 9/1/2015 74.8 -0.606880545
3000 10/1/2015 3.3 0.091169832
3000 11/1/2015 16.26 -0.250333333
3000 12/1/2015 8.75 NA
3100 1/1/2014 1.72 0.006743937
3100 2/1/2014 0.72 0.005017926
3100 3/1/2014 0.28 0.002294689
3100 4/1/2014 2.48 -0.001544827
3100 5/1/2014 34.28 -0.005486151
3100 6/1/2014 23.91 0.007658664
3100 7/1/2014 6.37 0.01884284
3100 8/1/2014 5.51 0.021336146
3100 9/1/2014 2.78 0.023634101
3100 10/1/2014 3.84 0.02372403
3100 11/1/2014 1.78 0.02423667
3100 12/1/2014 2.46 0.021476715
3100 1/1/2015 18.34 0.017138199
3100 2/1/2015 3.04 0.037319137
3100 3/1/2015 2.18 0.036422251
3100 4/1/2015 5.03 0.029632893
3100 5/1/2015 15.2 0.023099668
3100 6/1/2015 4.41 0.053426425
3100 7/1/2015 9.31 0.044761445
3100 8/1/2015 14.3 0.055473621
3100 9/1/2015 3.97 0.14743562
3100 10/1/2015 9.03 0.11738445
3100 11/1/2015 17.26 -0.037333333
3100 12/1/2015 16.14 NA
My solution so far, which does work, uses the last row per group as the "stopping point" for each group lm calculation. is to find the last row index for each group and populate df
column row
with the last row index for every row in each process_code
group by joining a "last row only" dataframe to the original dataframe. Then I use the row index in the for loop to create a subset of the data to run through the model.
library(plyr)
library(dplyr)
df$date <- lubridate::mdy(df$date)
df$slope <- 0
df$row <- row.names(df)
last_row_by_group = aggregate(df[, c('row')], list(df$process_code ), tail, 1)
names(last_row_by_group)[1] <- "process_code"
names(last_row_by_group)[2] <- "last_row"
df <- join(df, last_row_by_group, type = 'left')
df$last_row <- as.numeric(df$last_row)
for (i in 1:nrow(df)) {
lm_return <- lm(measure ~ date, df[i:df$last_row[i],])
df[i,"slope"] <- lm_return$coefficients[2]
}
So this works, the example output is above. In the real data, there are hundreds of groups and many years worth of data. How can I speed this up?
CodePudding user response:
Define a slope function, convert the date to Date class and grouping by process_code run rollapply over the measure and date using slope and the widths n, n-1, ..., 2, 1 using left alignment so that we start at the current value and go forward. coredata=FALSE ensures that a zoo object rather than a plain vector is passed to slope. The coredata at the end of the mutate pipeline converts the zoo result to a plain vector. Be sure plyr is NOT loaded in order to avoid name conflicts.
library(dplyr, exclude = c("filter", "lag"))
library(lubridate)
library(zoo)
slope <- function(x, tt = time(x)) cov(x, tt) / var(tt)
df %>%
mutate(date = mdy(date)) %>%
group_by(process_code) %>%
mutate(slope =
zoo(measure, as.numeric(date)) %>%
rollapply(n():1, slope, align = "left", fill = NA, coredata = FALSE) %>%
coredata()
) %>%
ungroup
giving:
# A tibble: 96 x 4
process_code date measure slope
<dbl> <date> <dbl> <dbl>
1 1000 2014-01-01 23.9 0.00436
2 1000 2014-02-01 6.37 0.0109
3 1000 2014-03-01 5.51 0.0118
4 1000 2014-04-01 2.78 0.0125
5 1000 2014-05-01 3.84 0.0117
6 1000 2014-06-01 1.78 0.0111
7 1000 2014-07-01 2.46 0.00873
8 1000 2014-08-01 18.3 0.00568
9 1000 2014-09-01 3.04 0.0143
10 1000 2014-10-01 2.18 0.0120
# ... with 86 more rows
Note
Because df has several values due to overwriting in the question to be clear we used this as df:
process_code <- c(1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100 )
date <- c("1/1/2014", "2/1/2014", "3/1/2014", "4/1/2014", "5/1/2014", "6/1/2014", "7/1/2014", "8/1/2014", "9/1/2014", "10/1/2014", "11/1/2014", "12/1/2014", "1/1/2015", "2/1/2015", "3/1/2015", "4/1/2015", "5/1/2015", "6/1/2015", "7/1/2015", "8/1/2015", "9/1/2015", "10/1/2015", "11/1/2015", "12/1/2015", "1/1/2014", "2/1/2014", "3/1/2014", "4/1/2014", "5/1/2014", "6/1/2014", "7/1/2014", "8/1/2014", "9/1/2014", "10/1/2014", "11/1/2014", "12/1/2014", "1/1/2015", "2/1/2015", "3/1/2015", "4/1/2015", "5/1/2015", "6/1/2015", "7/1/2015", "8/1/2015", "9/1/2015", "10/1/2015", "11/1/2015", "12/1/2015", "1/1/2014", "2/1/2014", "3/1/2014", "4/1/2014", "5/1/2014", "6/1/2014", "7/1/2014", "8/1/2014", "9/1/2014", "10/1/2014", "11/1/2014", "12/1/2014", "1/1/2015", "2/1/2015", "3/1/2015", "4/1/2015", "5/1/2015", "6/1/2015", "7/1/2015", "8/1/2015", "9/1/2015", "10/1/2015", "11/1/2015", "12/1/2015", "1/1/2014", "2/1/2014", "3/1/2014", "4/1/2014", "5/1/2014", "6/1/2014", "7/1/2014", "8/1/2014", "9/1/2014", "10/1/2014", "11/1/2014", "12/1/2014", "1/1/2015", "2/1/2015", "3/1/2015", "4/1/2015", "5/1/2015", "6/1/2015", "7/1/2015", "8/1/2015", "9/1/2015", "10/1/2015", "11/1/2015", "12/1/2015")
measure <- c(23.91, 6.37, 5.51, 2.78, 3.84, 1.78, 2.46, 18.34, 3.04, 2.18, 5.03, 15.20, 4.41, 9.31, 14.30, 3.97, 9.03, 17.26, 16.14, 1.72, 0.36, 4.93, 7.83, 24.91, 3.28, 3.24, 37.39, 12.78, 2.32, 0.42, 1.02, 1.06, 6.97, 1.58, 7.89, 0.98, 6.87, 1.64, 13.32, 4.79, 3.04, 0.06, 28.32, 61.10, 51.29, 1.74, 6.35, 15.64, 0.76, 1.80, 38.13, 20.81, 39.23, 0.54, 3.50, 14.88, 1.30, 3.64, 39.71, 19.42, 2.98, 24.49, 19.10, 43.34, 47.90, 0.06, 27.92, 9.41, 74.80, 3.30, 16.26, 8.75, 1.72, 0.72, 0.28, 2.48, 34.28, 23.91, 6.37, 5.51, 2.78, 3.84, 1.78, 2.46, 18.34, 3.04, 2.18, 5.03, 15.20, 4.41, 9.31, 14.30, 3.97, 9.03, 17.26, 16.14)
df <- data.frame(process_code, date, measure)