Get lm coefficient iteratively by subgroup and day-to-most-recent-day dropping furthest day each ite-CodePudding

I am trying to add a new column slope to a dataframe in R where the data represents groups of process codes and the numeric column represents a measure. I need to calculate the slope within each process code, starting at the earliest date through the most recent, then the next earliest date through the (same) most recent date, etc. This needs to be done for all groups. My current working code is below, however it is way too inefficient as it will be calculated on the fly within a shiny app.
I researched vectorization, but I am not sure how I could do that with this particular loop. Any help is appreciated. I am open to using any data structure (dataframe, data.table, tibble, etc.).

Data

process_code <- c(1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100 )
date <- c("1/1/2014", "2/1/2014", "3/1/2014", "4/1/2014", "5/1/2014", "6/1/2014", "7/1/2014", "8/1/2014", "9/1/2014", "10/1/2014", "11/1/2014", "12/1/2014", "1/1/2015", "2/1/2015", "3/1/2015", "4/1/2015", "5/1/2015", "6/1/2015", "7/1/2015", "8/1/2015", "9/1/2015", "10/1/2015", "11/1/2015", "12/1/2015", "1/1/2014", "2/1/2014", "3/1/2014", "4/1/2014", "5/1/2014", "6/1/2014", "7/1/2014", "8/1/2014", "9/1/2014", "10/1/2014", "11/1/2014", "12/1/2014", "1/1/2015", "2/1/2015", "3/1/2015", "4/1/2015", "5/1/2015", "6/1/2015", "7/1/2015", "8/1/2015", "9/1/2015", "10/1/2015", "11/1/2015", "12/1/2015", "1/1/2014", "2/1/2014", "3/1/2014", "4/1/2014", "5/1/2014", "6/1/2014", "7/1/2014", "8/1/2014", "9/1/2014", "10/1/2014", "11/1/2014", "12/1/2014", "1/1/2015", "2/1/2015", "3/1/2015", "4/1/2015", "5/1/2015", "6/1/2015", "7/1/2015", "8/1/2015", "9/1/2015", "10/1/2015", "11/1/2015", "12/1/2015", "1/1/2014", "2/1/2014", "3/1/2014", "4/1/2014", "5/1/2014", "6/1/2014", "7/1/2014", "8/1/2014", "9/1/2014", "10/1/2014", "11/1/2014", "12/1/2014", "1/1/2015", "2/1/2015", "3/1/2015", "4/1/2015", "5/1/2015", "6/1/2015", "7/1/2015", "8/1/2015", "9/1/2015", "10/1/2015", "11/1/2015", "12/1/2015")
measure <- c(23.91, 6.37, 5.51, 2.78, 3.84, 1.78, 2.46, 18.34, 3.04, 2.18, 5.03, 15.20, 4.41, 9.31, 14.30, 3.97, 9.03, 17.26, 16.14, 1.72, 0.36, 4.93, 7.83, 24.91, 3.28, 3.24, 37.39, 12.78, 2.32, 0.42, 1.02, 1.06, 6.97, 1.58, 7.89, 0.98, 6.87, 1.64, 13.32, 4.79, 3.04, 0.06, 28.32, 61.10, 51.29, 1.74, 6.35, 15.64, 0.76, 1.80, 38.13, 20.81, 39.23, 0.54, 3.50, 14.88, 1.30, 3.64, 39.71, 19.42, 2.98, 24.49, 19.10, 43.34, 47.90, 0.06, 27.92, 9.41, 74.80, 3.30, 16.26, 8.75, 1.72, 0.72, 0.28, 2.48, 34.28, 23.91, 6.37, 5.51, 2.78, 3.84, 1.78, 2.46, 18.34, 3.04, 2.18, 5.03, 15.20, 4.41, 9.31, 14.30, 3.97, 9.03, 17.26, 16.14)


df <- data.frame(process_code, date, measure)

To be just a bit more clear, I need the slope from a linear model for measures within each group (1000, 2000, 3000, 3100) where the new slope field for each row represents the coefficient for a lm of that row through the last date for that group.

Example output:

process_code    date            measure       slope
1000            1/1/2014        23.91         0.004358691 (calculated on rows 1:24)
1000            2/1/2014        6.37          0.010929872 (calculated on rows 2:24)
1000            3/1/2014        5.51          0.011844316 (calculated on rows 3:24)
1000            4/1/2014        2.78          0.012493401 (calculated on rows 4:24)
1000            5/1/2014        3.84          0.011738559 (calculated on rows 5:24)
1000            6/1/2014        1.78          0.011125547 (calculated on rows 6:24)
1000            7/1/2014        2.46          0.008732907 (calculated on rows 7:24)
1000            8/1/2014        18.34         0.005684154 (calculated on rows 8:24)
1000            9/1/2014        3.04          0.014278285 (calculated on rows 9:24)
1000            10/1/2014       2.18          0.011992905 (calculated on rows 10:24)
1000            11/1/2014       5.03          0.007247907 (calculated on rows 11:24)
1000            12/1/2014       15.2          0.003286292 (calculated on rows 12:24)
1000            1/1/2015        4.41          0.012013625 (calculated on rows 13:24)
1000            2/1/2015        9.31          0.006491157 (calculated on rows 14:24)
1000            3/1/2015        14.3          0.007156341 (calculated on rows 15:24)
1000            4/1/2015        3.97          0.021458616 (calculated on rows 16:24)
1000            5/1/2015        9.03          0.011040557 (calculated on rows 17:24)
1000            6/1/2015        17.26         0.010767026 (calculated on rows 18:24)
1000            7/1/2015        16.14         0.061592437 (calculated on rows 19:24)
1000            8/1/2015        1.72          0.176137292 (calculated on rows 20:24)
1000            9/1/2015        0.36          0.251455313 (calculated on rows 21:24)
1000            10/1/2015       4.93          0.326241491 (calculated on rows 22:24)
1000            11/1/2015       7.83          0.569333333 (calculated on rows 23:24)
1000            12/1/2015       24.91         NA          (calculated on rows 24:24)
2000            1/1/2014        3.28          0.022962316
2000            2/1/2014        3.24          0.022929315
2000            3/1/2014        37.39         0.022568032
2000            4/1/2014        12.78         0.037857024
2000            5/1/2014        2.32          0.044812372
2000            6/1/2014        0.42          0.047406643
2000            7/1/2014        1.02          0.048790496
2000            8/1/2014        1.06          0.050101068
2000            9/1/2014        6.97          0.050721253
2000            10/1/2014       1.58          0.055695014
2000            11/1/2014       7.89          0.055476477
2000            12/1/2014       0.98          0.060930758
2000            1/1/2015        6.87          0.056568831
2000            2/1/2015        1.64          0.056762249
2000            3/1/2015        13.32         0.042076169
2000            4/1/2015        4.79          0.043574206
2000            5/1/2015        3.04          0.01200964
2000            6/1/2015        0.06          -0.065807749
2000            7/1/2015        28.32         -0.257910924
2000            8/1/2015        61.1          -0.445256697
2000            9/1/2015        51.29         -0.335559403
2000            10/1/2015       1.74          0.227429237
2000            11/1/2015       6.35          0.309666667
2000            12/1/2015       15.64         NA
3000            1/1/2014        0.76          0.017380462
3000            2/1/2014        1.8           0.012550749
3000            3/1/2014        38.13         0.006578873
3000            4/1/2014        20.81         0.015682204
3000            5/1/2014        39.23         0.018549388
3000            6/1/2014        0.54          0.032761603
3000            7/1/2014        3.5           0.026633041
3000            8/1/2014        14.88         0.019617614
3000            9/1/2014        1.3           0.018499056
3000            10/1/2014       3.64          0.003585714
3000            11/1/2014       39.71         -0.016315323
3000            12/1/2014       19.42         -0.00068808
3000            1/1/2015        2.98          -0.006072538
3000            2/1/2015        24.49         -0.043833615
3000            3/1/2015        19.1          -0.059179313
3000            4/1/2015        43.34         -0.097751338
3000            5/1/2015        47.9          -0.077973318
3000            6/1/2015        0.06          -0.003093107
3000            7/1/2015        27.92         -0.135403423
3000            8/1/2015        9.41          -0.193780797
3000            9/1/2015        74.8          -0.606880545
3000            10/1/2015       3.3           0.091169832
3000            11/1/2015       16.26         -0.250333333
3000            12/1/2015       8.75          NA
3100            1/1/2014        1.72          0.006743937
3100            2/1/2014        0.72          0.005017926
3100            3/1/2014        0.28          0.002294689
3100            4/1/2014        2.48          -0.001544827
3100            5/1/2014        34.28         -0.005486151
3100            6/1/2014        23.91         0.007658664
3100            7/1/2014        6.37          0.01884284
3100            8/1/2014        5.51          0.021336146
3100            9/1/2014        2.78          0.023634101
3100            10/1/2014       3.84          0.02372403
3100            11/1/2014       1.78          0.02423667
3100            12/1/2014       2.46          0.021476715
3100            1/1/2015        18.34         0.017138199
3100            2/1/2015        3.04          0.037319137
3100            3/1/2015        2.18          0.036422251
3100            4/1/2015        5.03          0.029632893
3100            5/1/2015        15.2          0.023099668
3100            6/1/2015        4.41          0.053426425
3100            7/1/2015        9.31          0.044761445
3100            8/1/2015        14.3          0.055473621
3100            9/1/2015        3.97          0.14743562
3100            10/1/2015       9.03          0.11738445
3100            11/1/2015       17.26         -0.037333333
3100            12/1/2015       16.14         NA

My solution so far, which does work, uses the last row per group as the "stopping point" for each group lm calculation. is to find the last row index for each group and populate df column row with the last row index for every row in each process_code group by joining a "last row only" dataframe to the original dataframe. Then I use the row index in the for loop to create a subset of the data to run through the model.

library(plyr)
library(dplyr)
df$date <- lubridate::mdy(df$date)
df$slope <- 0
df$row <- row.names(df)
last_row_by_group = aggregate(df[, c('row')], list(df$process_code ), tail, 1)
names(last_row_by_group)[1] <- "process_code"
names(last_row_by_group)[2] <- "last_row"
df <- join(df, last_row_by_group, type = 'left')
df$last_row <- as.numeric(df$last_row)

for (i in 1:nrow(df)) {
    lm_return <- lm(measure ~ date, df[i:df$last_row[i],])
    df[i,"slope"] <- lm_return$coefficients[2]
}

So this works, the example output is above. In the real data, there are hundreds of groups and many years worth of data. How can I speed this up?

CodePudding user response：

Define a slope function, convert the date to Date class and grouping by process_code run rollapply over the measure and date using slope and the widths n, n-1, ..., 2, 1 using left alignment so that we start at the current value and go forward. coredata=FALSE ensures that a zoo object rather than a plain vector is passed to slope. The coredata at the end of the mutate pipeline converts the zoo result to a plain vector. Be sure plyr is NOT loaded in order to avoid name conflicts.

library(dplyr, exclude = c("filter", "lag"))
library(lubridate)
library(zoo)

slope <- function(x, tt = time(x)) cov(x, tt) / var(tt)
df %>%
  mutate(date = mdy(date)) %>%
  group_by(process_code) %>%
  mutate(slope = 
    zoo(measure, as.numeric(date)) %>% 
      rollapply(n():1, slope, align = "left", fill = NA, coredata = FALSE) %>%
      coredata()
  ) %>%
  ungroup

giving:

# A tibble: 96 x 4
   process_code date       measure   slope
          <dbl> <date>       <dbl>   <dbl>
 1         1000 2014-01-01   23.9  0.00436
 2         1000 2014-02-01    6.37 0.0109 
 3         1000 2014-03-01    5.51 0.0118 
 4         1000 2014-04-01    2.78 0.0125 
 5         1000 2014-05-01    3.84 0.0117 
 6         1000 2014-06-01    1.78 0.0111 
 7         1000 2014-07-01    2.46 0.00873
 8         1000 2014-08-01   18.3  0.00568
 9         1000 2014-09-01    3.04 0.0143 
10         1000 2014-10-01    2.18 0.0120 
# ... with 86 more rows

Note

Because df has several values due to overwriting in the question to be clear we used this as df:

process_code <- c(1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100 )
date <- c("1/1/2014", "2/1/2014", "3/1/2014", "4/1/2014", "5/1/2014", "6/1/2014", "7/1/2014", "8/1/2014", "9/1/2014", "10/1/2014", "11/1/2014", "12/1/2014", "1/1/2015", "2/1/2015", "3/1/2015", "4/1/2015", "5/1/2015", "6/1/2015", "7/1/2015", "8/1/2015", "9/1/2015", "10/1/2015", "11/1/2015", "12/1/2015", "1/1/2014", "2/1/2014", "3/1/2014", "4/1/2014", "5/1/2014", "6/1/2014", "7/1/2014", "8/1/2014", "9/1/2014", "10/1/2014", "11/1/2014", "12/1/2014", "1/1/2015", "2/1/2015", "3/1/2015", "4/1/2015", "5/1/2015", "6/1/2015", "7/1/2015", "8/1/2015", "9/1/2015", "10/1/2015", "11/1/2015", "12/1/2015", "1/1/2014", "2/1/2014", "3/1/2014", "4/1/2014", "5/1/2014", "6/1/2014", "7/1/2014", "8/1/2014", "9/1/2014", "10/1/2014", "11/1/2014", "12/1/2014", "1/1/2015", "2/1/2015", "3/1/2015", "4/1/2015", "5/1/2015", "6/1/2015", "7/1/2015", "8/1/2015", "9/1/2015", "10/1/2015", "11/1/2015", "12/1/2015", "1/1/2014", "2/1/2014", "3/1/2014", "4/1/2014", "5/1/2014", "6/1/2014", "7/1/2014", "8/1/2014", "9/1/2014", "10/1/2014", "11/1/2014", "12/1/2014", "1/1/2015", "2/1/2015", "3/1/2015", "4/1/2015", "5/1/2015", "6/1/2015", "7/1/2015", "8/1/2015", "9/1/2015", "10/1/2015", "11/1/2015", "12/1/2015")
measure <- c(23.91, 6.37, 5.51, 2.78, 3.84, 1.78, 2.46, 18.34, 3.04, 2.18, 5.03, 15.20, 4.41, 9.31, 14.30, 3.97, 9.03, 17.26, 16.14, 1.72, 0.36, 4.93, 7.83, 24.91, 3.28, 3.24, 37.39, 12.78, 2.32, 0.42, 1.02, 1.06, 6.97, 1.58, 7.89, 0.98, 6.87, 1.64, 13.32, 4.79, 3.04, 0.06, 28.32, 61.10, 51.29, 1.74, 6.35, 15.64, 0.76, 1.80, 38.13, 20.81, 39.23, 0.54, 3.50, 14.88, 1.30, 3.64, 39.71, 19.42, 2.98, 24.49, 19.10, 43.34, 47.90, 0.06, 27.92, 9.41, 74.80, 3.30, 16.26, 8.75, 1.72, 0.72, 0.28, 2.48, 34.28, 23.91, 6.37, 5.51, 2.78, 3.84, 1.78, 2.46, 18.34, 3.04, 2.18, 5.03, 15.20, 4.41, 9.31, 14.30, 3.97, 9.03, 17.26, 16.14)


df <- data.frame(process_code, date, measure)