How to find the change in value between subsequent groups from an initial group in R?-CodePudding

I want to find the change of the height at each id for each event (B,C,D) from the initial event (A). For example, the change of A1 to B1 is 5 or A1 to C1 is 11.

df <- data.frame(event=c(rep('A',4),rep('B',4),rep('C',4),rep('D',4)), id=rep(1:4,4),height = c(1,4,2,4,6,8,9,6,12,15,13,16,21,24,26,24))
    event id height
1      A  1      1
2      A  2      4
3      A  3      2
4      A  4      4
5      B  1      6
6      B  2      8
7      B  3      9
8      B  4      6
9      C  1     12
10     C  2     15
11     C  3     13
12     C  4     16
13     D  1     21
14     D  2     24
15     D  3     26
16     D  4     24

To try to answer this I made a separate dataframe of just initial event points and then subtracted them from each event group.

df_a <- filter(df,event =="A") # filtered only initial event points

df %>%
  group_by(event) %>% # grouped events
  mutate(growth = height - df_a$height) # subtracted grouped event points by the initial event 


   event    id height growth
 1 A         1      1      0
 2 A         2      4      0
 3 A         3      2      0
 4 A         4      4      0
 5 B         1      6      5
 6 B         2      8      4
 7 B         3      9      7
 8 B         4      6      2
 9 C         1     12     11
10 C         2     15     11
11 C         3     13     11
12 C         4     16     12
13 D         1     21     20
14 D         2     24     20
15 D         3     26     24
16 D         4     24     20

But I am worried about duplicate, missing, or mistyped events or IDs that might throw off this vectored approach. I need suggestions on how to confirm that I will find the correct change for each respective ID. (A1 to B1, C1, or D1) and (A2 to B2, C2, D2) and so on.

Can I do this with looping? I am not very confident with looping but could looping accomplish this by confirming that only matching IDs can be used?

Or how can I confirm that every initial ID has a respective ID(or NA) at each event so the vectored approach will work 100% of the time.

Thank you.

CodePudding user response：

Do a grouping by 'id' and then subtract from subsetting the 'height' where 'event' is 'A'. Just to avoid for any missing values use in% instead of == (as %in% returns FALSE for NA whereas == returns NA) and also, after the subset, get the first element ([1]) in case there are duplicates

library(dplyr)
df %>%
    group_by(id) %>% 
    mutate(growth =  height - height[event %in% 'A'][1]) %>%
    ungroup

-output

# A tibble: 16 x 4
   event    id height growth
   <chr> <int>  <int>  <int>
 1 A         1      1      0
 2 A         2      4      0
 3 A         3      2      0
 4 A         4      4      0
 5 B         1      6      5
 6 B         2      8      4
 7 B         3      9      7
 8 B         4      6      2
 9 C         1     12     11
10 C         2     15     11
11 C         3     13     11
12 C         4     16     12
13 D         1     21     20
14 D         2     24     20
15 D         3     26     24
16 D         4     24     20

We can also test where there are no 'A's in a 'id'

> df %>% 
      slice(-1) %>%  # removed the first row
      group_by(id) %>% 
      mutate(growth =  height - height[event %in% 'A'][1]) %>%
      ungroup
# A tibble: 15 x 4
   event    id height growth
   <chr> <int>  <int>  <int>
 1 A         2      4      0
 2 A         3      2      0
 3 A         4      4      0
 4 B         1      6     NA # note the NAs for those cases
 5 B         2      8      4
 6 B         3      9      7
 7 B         4      6      2
 8 C         1     12     NA
 9 C         2     15     11
10 C         3     13     11
11 C         4     16     12
12 D         1     21     NA
13 D         2     24     20
14 D         3     26     24
15 D         4     24     20

data

df <- structure(list(event = c("A", "A", "A", "A", "B", "B", "B", "B", 
"C", "C", "C", "C", "D", "D", "D", "D"), id = c(1L, 2L, 3L, 4L, 
1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L), height = c(1L, 
4L, 2L, 4L, 6L, 8L, 9L, 6L, 12L, 15L, 13L, 16L, 21L, 24L, 26L, 
24L)), class = "data.frame", row.names = c("1", "2", "3", "4", 
"5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", 
"16"))