Home > Net >  Sum duplicate rows that are grouped and combine their IDs in R
Sum duplicate rows that are grouped and combine their IDs in R

Time:07-25

I have seal data that is grouped by columns: ID, dive number, and dive phase. I summarize this data over 20 second intervals from my original dataframe. When I do this, it creates duplicate 20 second time intervals if two different dive phase types overlap in a 20 second window. I'd like to sum the values of beats_20max in those duplicate rows then have their dive phase be combination of the two (ex: A, B would become AB). Hopefully this will make sense after looking at my data below.

seal_ID diveNum dive_phase datetime            HR_mean HR_max beats20_mean beats20_max
   <chr>     <dbl> <chr>      <dttm>                <dbl>  <dbl>        <dbl>       <int>
 8 Baikal       19 D          2019-04-02 14:43:00    38.6   44.8          6.5          12
 9 Baikal       19 D          2019-04-02 14:43:20    42.2   48            7.5          14
10 Baikal       19 D          2019-04-02 14:43:40    44.0   54.1          8            15
11 Baikal       19 D          2019-04-02 14:44:00    45.5   61.9          8            15
12 Baikal       19 D          2019-04-02 14:44:20    42.1   49.2          7.5          14
13 Baikal       19 D          2019-04-02 14:44:40    39.9   44.1          7            13
14 Baikal       19 D          2019-04-02 14:45:00    45.5   54.5          8            15
15 Baikal       19 D          2019-04-02 14:45:20    44.6   53.1          8            15
16 Baikal       19 D          2019-04-02 14:45:40    45.9   51.7          8            15
17 Baikal       19 B          2019-04-02 14:46:00    46.1   51.7          7.5          14
18 Baikal       19 D          2019-04-02 14:46:00    55.8   59.4          1.5           2
19 Baikal       19 B          2019-04-02 14:46:20    47.4   57.1          8            15
20 Baikal       19 B          2019-04-02 14:46:40    45.4   53.6          8            15

As you can see, lines 17 and 18 are duplicate times but different dive phases, I'd like to sum the beats20_max column and make their dive phase "DB". There are multiple instances of this throughout the dataframe, so if there's a way I can just aggregate or use dplyr to fix this that would be very helpful. I should mention that when I do this aggregation or summarization, I'll need to make sure I still group by seal_ID and diveNum because some seal's datetimes are the same. Thanks for any advice!

Ideal outcome:

seal_ID diveNum dive_phase datetime            beats20_max
   <chr>     <dbl> <chr>      <dttm>                <int>
 8 Baikal       19 D          2019-04-02 14:43:00    12
 9 Baikal       19 D          2019-04-02 14:43:20    14
10 Baikal       19 D          2019-04-02 14:43:40    15
11 Baikal       19 D          2019-04-02 14:44:00    15
12 Baikal       19 D          2019-04-02 14:44:20    14
13 Baikal       19 D          2019-04-02 14:44:40    13
14 Baikal       19 D          2019-04-02 14:45:00    15
15 Baikal       19 D          2019-04-02 14:45:20    15
16 Baikal       19 D          2019-04-02 14:45:40    15
17 Baikal       19 DB         2019-04-02 14:46:00    16  
18 Baikal       19 B          2019-04-02 14:46:20    15
19 Baikal       19 B          2019-04-02 14:46:40    15

CodePudding user response:

We could use group_by with summarise to paste (str_c) the 'dive_phase' and sum the 'beats20_max'

library(dplyr)
library(stringr)
df1 %>%
    group_by(seal_ID, diveNum, datetime) %>%
    summarise(dive_phase = str_c(dive_phase, collapse = ""), 
         beats20_max = sum(beats20_max, na.rm = TRUE), .groups = 'drop') %>%
    select(any_of(names(df1)))

-output

# A tibble: 12 × 5
   seal_ID diveNum dive_phase datetime            beats20_max
   <chr>     <int> <chr>      <chr>                     <int>
 1 Baikal       19 D          2019-04-02 14:43:00          12
 2 Baikal       19 D          2019-04-02 14:43:20          14
 3 Baikal       19 D          2019-04-02 14:43:40          15
 4 Baikal       19 D          2019-04-02 14:44:00          15
 5 Baikal       19 D          2019-04-02 14:44:20          14
 6 Baikal       19 D          2019-04-02 14:44:40          13
 7 Baikal       19 D          2019-04-02 14:45:00          15
 8 Baikal       19 D          2019-04-02 14:45:20          15
 9 Baikal       19 D          2019-04-02 14:45:40          15
10 Baikal       19 BD         2019-04-02 14:46:00          16
11 Baikal       19 B          2019-04-02 14:46:20          15
12 Baikal       19 B          2019-04-02 14:46:40          15

data

df1 <- structure(list(seal_ID = c("Baikal", "Baikal", "Baikal", "Baikal", 
"Baikal", "Baikal", "Baikal", "Baikal", "Baikal", "Baikal", "Baikal", 
"Baikal", "Baikal"), diveNum = c(19L, 19L, 19L, 19L, 19L, 19L, 
19L, 19L, 19L, 19L, 19L, 19L, 19L), dive_phase = c("D", "D", 
"D", "D", "D", "D", "D", "D", "D", "B", "D", "B", "B"), 
atetime = c("2019-04-02 14:43:00", 
"2019-04-02 14:43:20", "2019-04-02 14:43:40", "2019-04-02 14:44:00", 
"2019-04-02 14:44:20", "2019-04-02 14:44:40", "2019-04-02 14:45:00", 
"2019-04-02 14:45:20", "2019-04-02 14:45:40", "2019-04-02 14:46:00", 
"2019-04-02 14:46:00", "2019-04-02 14:46:20", "2019-04-02 14:46:40"
), HR_mean = c(38.6, 42.2, 44, 45.5, 42.1, 39.9, 45.5, 44.6, 
45.9, 46.1, 55.8, 47.4, 45.4), HR_max = c(44.8, 48, 54.1, 61.9, 
49.2, 44.1, 54.5, 53.1, 51.7, 51.7, 59.4, 57.1, 53.6), beats20_mean = c(6.5, 
7.5, 8, 8, 7.5, 7, 8, 8, 8, 7.5, 1.5, 8, 8), beats20_max = c(12L, 
14L, 15L, 15L, 14L, 13L, 15L, 15L, 15L, 14L, 2L, 15L, 15L)),
 class = "data.frame", row.names = c("8", 
"9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", 
"20"))
  • Related