I'm struggling on how can I calculate the difference between the first and last value, arranged by date, by groups. Here is a toy example:
test1 = data.frame(my_groups = c("A", "A", "A", "B", "B", "B", "C", "C", "C", "A", "A", "A"),
measure = c(10, 20, 5, 64, 2, 62 ,2, 5, 4, 6, 7, 105),
#distance = c(),
time= as.Date(c("20-09-2020", "25-09-2020", "19-09-2020", "20-05-2020", "20-05-2020", "20-06-2021",
"11-01-2021", "13-01-2021", "13-01-2021", "15-01-2021", "15-01-2021", "19-01-2021"), format = "%d-%m-%Y"))
# test1 %>% arrange(my_groups, time)
# my_groups measure time
# 1 A 5 2020-09-19
# 2 A 10 2020-09-20
# 3 A 20 2020-09-25
# 4 A 6 2021-01-15
# 5 A 7 2021-01-15
# 6 A 105 2021-01-19
# 7 B 64 2020-05-20
# 8 B 2 2020-05-20
# 9 B 62 2021-06-20
# 10 C 2 2021-01-11
# 11 C 5 2021-01-13
# 12 C 1 2021-01-13
#desired result
# my_groups diff
# 1 A 100 (105 - 5)
# 2 B 2 (64 - 62)
# 3 C 1 (1 - 2)
The equation inside the brackets in desired result
is just to show where the diff
came from.
Any hint on how can I do that?
CodePudding user response:
Your sample data in data.frame
does not match your console output, so results will be different.
Two methods, depending on a few factors.
Assuming that order is externally controlled,
test1 %>% group_by(my_groups) %>% slice(c(1, n())) %>% summarize(diff = diff(measure)) # # A tibble: 3 x 2 # my_groups diff # <chr> <dbl> # 1 A 95 # 2 B -2 # 3 C 2
or just
test1 %>% group_by(my_groups) %>% summarize(diff = measure[n()] - measure[1])
The advantage of this is that it counters an issue with approach 2 below (ties in
which.max
): if you control the ordering yourself, you are guaranteed to use the first/last values you need.NOTE that for this portion, I assume that the order of data you gave us in your sample data is relevant. I'm assuming that there is some way to guarantee that your results are found. With your latest comment, we can
arrange
before the summarization and get closer to your desired results withtest1 %>% arrange(time, -measure) %>% # this is the "external" sorting I mentioned, so we don't need which.min/.max group_by(my_groups) %>% summarize(diff = measure[n()] - measure[1]) # # A tibble: 3 x 2 # my_groups diff # <chr> <dbl> # 1 A 100 # 2 B -2 # 3 C 2
Without pre-sorting, we can use
which.min
andwhich.max
. The problem with this is that when ties occur, it may not choose the one that you want.test1 %>% group_by(my_groups) %>% summarize(diff = measure[which.max(time)] - measure[which.min(time)]) # # A tibble: 3 x 2 # my_groups diff # <chr> <dbl> # 1 A 100 # 2 B -2 # 3 C 3
CodePudding user response:
test1 %>%
dplyr::group_by(my_groups) %>%
dplyr::mutate(
first = min(time), last = max(time),
) %>%
dplyr::select(-time, -measure) %>%
dplyr::distinct() %>%
dplyr::mutate(diff = first - last) %>%
dplyr::select(-first, -last)