I have seal data that is grouped by columns: ID, dive number, and dive phase. I summarize this data over 20 second intervals from my original dataframe. When I do this, it creates duplicate 20 second time intervals if two different dive phase types overlap in a 20 second window. I'd like to sum the values of beats_20max in those duplicate rows then have their dive phase be combination of the two (ex: A, B would become AB). Hopefully this will make sense after looking at my data below.
seal_ID diveNum dive_phase datetime HR_mean HR_max beats20_mean beats20_max
<chr> <dbl> <chr> <dttm> <dbl> <dbl> <dbl> <int>
8 Baikal 19 D 2019-04-02 14:43:00 38.6 44.8 6.5 12
9 Baikal 19 D 2019-04-02 14:43:20 42.2 48 7.5 14
10 Baikal 19 D 2019-04-02 14:43:40 44.0 54.1 8 15
11 Baikal 19 D 2019-04-02 14:44:00 45.5 61.9 8 15
12 Baikal 19 D 2019-04-02 14:44:20 42.1 49.2 7.5 14
13 Baikal 19 D 2019-04-02 14:44:40 39.9 44.1 7 13
14 Baikal 19 D 2019-04-02 14:45:00 45.5 54.5 8 15
15 Baikal 19 D 2019-04-02 14:45:20 44.6 53.1 8 15
16 Baikal 19 D 2019-04-02 14:45:40 45.9 51.7 8 15
17 Baikal 19 B 2019-04-02 14:46:00 46.1 51.7 7.5 14
18 Baikal 19 D 2019-04-02 14:46:00 55.8 59.4 1.5 2
19 Baikal 19 B 2019-04-02 14:46:20 47.4 57.1 8 15
20 Baikal 19 B 2019-04-02 14:46:40 45.4 53.6 8 15
As you can see, lines 17 and 18 are duplicate times but different dive phases, I'd like to sum the beats20_max column and make their dive phase "DB". There are multiple instances of this throughout the dataframe, so if there's a way I can just aggregate or use dplyr to fix this that would be very helpful. I should mention that when I do this aggregation or summarization, I'll need to make sure I still group by seal_ID and diveNum because some seal's datetimes are the same. Thanks for any advice!
Ideal outcome:
seal_ID diveNum dive_phase datetime beats20_max
<chr> <dbl> <chr> <dttm> <int>
8 Baikal 19 D 2019-04-02 14:43:00 12
9 Baikal 19 D 2019-04-02 14:43:20 14
10 Baikal 19 D 2019-04-02 14:43:40 15
11 Baikal 19 D 2019-04-02 14:44:00 15
12 Baikal 19 D 2019-04-02 14:44:20 14
13 Baikal 19 D 2019-04-02 14:44:40 13
14 Baikal 19 D 2019-04-02 14:45:00 15
15 Baikal 19 D 2019-04-02 14:45:20 15
16 Baikal 19 D 2019-04-02 14:45:40 15
17 Baikal 19 DB 2019-04-02 14:46:00 16
18 Baikal 19 B 2019-04-02 14:46:20 15
19 Baikal 19 B 2019-04-02 14:46:40 15
CodePudding user response:
We could use group_by
with summarise
to paste
(str_c
) the 'dive_phase' and sum
the 'beats20_max'
library(dplyr)
library(stringr)
df1 %>%
group_by(seal_ID, diveNum, datetime) %>%
summarise(dive_phase = str_c(dive_phase, collapse = ""),
beats20_max = sum(beats20_max, na.rm = TRUE), .groups = 'drop') %>%
select(any_of(names(df1)))
-output
# A tibble: 12 × 5
seal_ID diveNum dive_phase datetime beats20_max
<chr> <int> <chr> <chr> <int>
1 Baikal 19 D 2019-04-02 14:43:00 12
2 Baikal 19 D 2019-04-02 14:43:20 14
3 Baikal 19 D 2019-04-02 14:43:40 15
4 Baikal 19 D 2019-04-02 14:44:00 15
5 Baikal 19 D 2019-04-02 14:44:20 14
6 Baikal 19 D 2019-04-02 14:44:40 13
7 Baikal 19 D 2019-04-02 14:45:00 15
8 Baikal 19 D 2019-04-02 14:45:20 15
9 Baikal 19 D 2019-04-02 14:45:40 15
10 Baikal 19 BD 2019-04-02 14:46:00 16
11 Baikal 19 B 2019-04-02 14:46:20 15
12 Baikal 19 B 2019-04-02 14:46:40 15
data
df1 <- structure(list(seal_ID = c("Baikal", "Baikal", "Baikal", "Baikal",
"Baikal", "Baikal", "Baikal", "Baikal", "Baikal", "Baikal", "Baikal",
"Baikal", "Baikal"), diveNum = c(19L, 19L, 19L, 19L, 19L, 19L,
19L, 19L, 19L, 19L, 19L, 19L, 19L), dive_phase = c("D", "D",
"D", "D", "D", "D", "D", "D", "D", "B", "D", "B", "B"),
atetime = c("2019-04-02 14:43:00",
"2019-04-02 14:43:20", "2019-04-02 14:43:40", "2019-04-02 14:44:00",
"2019-04-02 14:44:20", "2019-04-02 14:44:40", "2019-04-02 14:45:00",
"2019-04-02 14:45:20", "2019-04-02 14:45:40", "2019-04-02 14:46:00",
"2019-04-02 14:46:00", "2019-04-02 14:46:20", "2019-04-02 14:46:40"
), HR_mean = c(38.6, 42.2, 44, 45.5, 42.1, 39.9, 45.5, 44.6,
45.9, 46.1, 55.8, 47.4, 45.4), HR_max = c(44.8, 48, 54.1, 61.9,
49.2, 44.1, 54.5, 53.1, 51.7, 51.7, 59.4, 57.1, 53.6), beats20_mean = c(6.5,
7.5, 8, 8, 7.5, 7, 8, 8, 8, 7.5, 1.5, 8, 8), beats20_max = c(12L,
14L, 15L, 15L, 14L, 13L, 15L, 15L, 15L, 14L, 2L, 15L, 15L)),
class = "data.frame", row.names = c("8",
"9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19",
"20"))