I want to calculate the percentage difference based on groups in b
(A,A;B,B;C,C) of a data.frame
. The new variable rate
should contain the change in c
based on the time stamps a
(i.e. 3 and 7, while the smaller number is always the previous). It should be NA
, when there is just one time stamp for a given group.
I tried to use the common methods, but i always get a vector of zeros. Example data:
a <- c(3,7,3,7,3,7,3,3,7,3,7,3,7,7)
b <- c("a", "a", "b", "b", "c", "c", "d", "e", "e", "f","f", "g", "g", "h")
c <- runif(14, min=80, max=100)
df <- data.frame(a,b,c)
I tried the common approach with using group_by
, mutate
and lag
i.e.
df %>% group_by(a,b) %>%
mutate(rate = 100 * (c - lag(c))/lag(c))
I also tried arrange
before using lag
, but the result i get is always 0
. I think its because there is not a time series which is why we cant use lag
. There must be an easy way to achieve this with using regular functions.
CodePudding user response:
If you use lag()
to define your rate, you will end up with "NA" for every b == 3, and the correct value for b == 7 when you group_by "b". Also, c()
is a Primitive function in R and it's best not to use "c" as the name of a variable.
Is this the outcome you're after?
library(tidyverse)
a <- c(3,7,3,7,3,7,3,3,7,3,7,3,7,7)
b <- c("a", "a", "b", "b", "c", "c", "d", "e", "e", "f","f", "g", "g", "h")
d <- c(80, 100, runif(12, min=80, max=100))
df <- data.frame(a,b,d)
df %>% group_by(b) %>%
mutate(rate = 100 * (d - lag(d, default = NA))/lag(d, default = NA))
#> # A tibble: 14 × 4
#> # Groups: b [8]
#> a b d rate
#> <dbl> <chr> <dbl> <dbl>
#> 1 3 a 80 NA
#> 2 7 a 100 25
#> 3 3 b 88.0 NA
#> 4 7 b 91.1 3.54
#> 5 3 c 95.1 NA
#> 6 7 c 82.7 -13.1
#> 7 3 d 92.6 NA
#> 8 3 e 84.1 NA
#> 9 7 e 91.8 9.20
#> 10 3 f 81.9 NA
#> 11 7 f 93.6 14.4
#> 12 3 g 88.7 NA
#> 13 7 g 80.6 -9.11
#> 14 7 h 99.2 NA
Created on 2021-12-20 by the reprex package (v2.0.1)
You can be more flexible with ifelse()
's too, e.g. if you want NA's for cases where you have a single group but zeros for cases where a == 3:
library(tidyverse)
a <- c(3,7,3,7,3,7,3,3,7,3,7,3,7,7)
b <- c("a", "a", "b", "b", "c", "c", "d", "e", "e", "f","f", "g", "g", "h")
d <- c(80, 100, runif(12, min=80, max=100))
df <- data.frame(a,b,d)
df %>% group_by(b) %>%
mutate(group_number = n()) %>%
mutate(rate = ifelse(group_number == 1, NA, ifelse(a == 7, 100 * (d - lag(d, default = NA))/lag(d, default = NA), 0))) %>%
select(-group_number) %>%
ungroup()
#> # A tibble: 14 × 4
#> a b d rate
#> <dbl> <chr> <dbl> <dbl>
#> 1 3 a 80 0
#> 2 7 a 100 25
#> 3 3 b 95.8 0
#> 4 7 b 83.9 -12.5
#> 5 3 c 87.0 0
#> 6 7 c 81.5 -6.26
#> 7 3 d 97.0 NA
#> 8 3 e 99.1 0
#> 9 7 e 82.6 -16.6
#> 10 3 f 82.3 0
#> 11 7 f 96.0 16.7
#> 12 3 g 99.5 0
#> 13 7 g 93.4 -6.09
#> 14 7 h 86.8 NA
Created on 2021-12-20 by the reprex package (v2.0.1)