Home > database >  R: summarise with linear regression
R: summarise with linear regression

Time:01-29

I have a dataset with a yearly average over several items.

data have:

df <- data.frame(item  = c(1, 1, 1, 1,  2, 2, 2, 2,  3, 3, 3, 3,  4, 4, 4, 4),
                 year  = c(1, 2, 3, 4,  1, 2, 3, 4,  1, 2, 3, 4,  1, 2, 3, 4)
                 value = c(1, 2, 3, 3,  2, 3, 4, 0,  1, 1, 3, 2,  2, 1, 1, 2))

I need to produce a simple linear regression model, between the year and average, separately for each item.

data want:

| year |  slope |  intercept |
|:----:|:------:|:----------:|
|    1 |   0.7  |        0.5 |
|    2 |  -0.2  |        3.0 |    
|    3 |   0.5  |        0.5 |  
|    4 |   0.0  |        1.5 |

my attempt:

I believe it could be done with some combination of group_by(), summarise() amd lm().

so I tried doing this:

library(dplyr)
df %>%
  group_by(item) %>%
  summarise(slope     = unname(lm(value ~ year, data=df)[[1]])[2],
            intercept = unname(lm(value ~ year, data=df)[[1]])[1])

but it returns the same slope and intercept for all lines. In other words, group_by was not applied.

Thus, my questions are:

  • How can the grouping be achieved?
  • How can the calculation be applied to each item separately?

CodePudding user response:

With lm( ..., data=df) you still use the same df object you started with, to make your own approach work you would need to replace it with something that's aware of the context (i.e. grouping) and returns the current data for the current group: cur_data()

library(dplyr)
df <- data.frame(item  = c(1, 1, 1, 1,  2, 2, 2, 2,  3, 3, 3, 3,  4, 4, 4, 4),
                 year  = c(1, 2, 3, 4,  1, 2, 3, 4,  1, 2, 3, 4,  1, 2, 3, 4),
                 value = c(1, 2, 3, 3,  2, 3, 4, 0,  1, 1, 3, 2,  2, 1, 1, 2))

df %>%
  group_by(item) %>%
  summarise(slope     = unname(lm(value ~ year, data=cur_data())[[1]])[2],
            intercept = unname(lm(value ~ year, data=cur_data())[[1]])[1])
#> # A tibble: 4 × 3
#>    item slope intercept
#>   <dbl> <dbl>     <dbl>
#> 1     1   0.7       0.5
#> 2     2  -0.5       3.5
#> 3     3   0.5       0.5
#> 4     4   0         1.5

Created on 2023-01-29 with reprex v2.0.2

CodePudding user response:

You may try by.

data.frame(unique(df$item), t(sapply(by(df, df$item, lm, fo=value ~ year), coef))) |>
  `colnames<-`(c('year', 'intercept', 'slope'))
#   year intercept slope
# 1    1       0.5   0.7
# 2    2       3.5  -0.5
# 3    3       0.5   0.5
# 4    4       1.5   0.0

Perhaps easier to understand, but yields the same:

by(df, df$item, lm, fo = value ~ year) |>
  sapply(coef) |>
  t() |>
  data.frame(unique(df$item)) |>
  `colnames<-`(c('intercept', 'slope', 'year'))

CodePudding user response:

library(tidyverse)

df %>% group_by(item) %>% 
do(reg = lm(value ~ year, data = .)$coefficients) %>%
unnest_wider(reg)

gives,

#   item `(Intercept)`   year
#  <dbl>         <dbl>  <dbl>
#1     1         0.500  0.7  
#2     2         3.50  -0.500
#3     3         0.5    0.5  
#4     4         1.5    0  

CodePudding user response:

Another approach using tidy and converting the results to wider format using pivot_wider like this (you could add more results like p-value):

library(dplyr)
library(tidyr)
library(broom)
df %>%
  group_by(item) %>%
  do(tidy(lm(value ~ year, data = .))) %>%
  select(item, term, estimate) %>%
  pivot_wider(names_from = term, values_from = estimate) %>%
  rename(slope = year, intercept = `(Intercept)`)
#> # A tibble: 4 × 3
#> # Groups:   item [4]
#>    item intercept slope
#>   <dbl>     <dbl> <dbl>
#> 1     1       0.5   0.7
#> 2     2       3.5  -0.5
#> 3     3       0.5   0.5
#> 4     4       1.5   0

Created on 2023-01-29 with reprex v2.0.2

CodePudding user response:

Here is one more solution (quite verbose) using group_split to make lists, then use map_dfr to iterate over each list. Get the tidy output with brooms tidy() and finally bring in shape:

library(broom)
library(tidyverse)
df %>% 
  mutate(item = factor(item)) %>% 
  group_split(item) %>% 
  map_dfr(.f = function(df){
    lm(value ~ year, data = df) %>% 
      tidy() %>% 
      add_column(item = unique(df$item), .before = 1)
  }) %>% 
  mutate(term = str_replace_all(term, "[^[:alnum:]]", "")) %>% 
  select(item, term, estimate) %>% 
  pivot_wider(names_from = term,
              values_from = estimate)
  item  Intercept  year
  <fct>     <dbl> <dbl>
1 1           0.5   0.7
2 2           3.5  -0.5
3 3           0.5   0.5
4 4           1.5   0  
  • Related