Home > OS >  Combining observations from same day while keeping earliest start and latest end time
Combining observations from same day while keeping earliest start and latest end time

Time:09-07

I am working with a dataset containing observations of animals during a given time period as well as data on how much they ate during that time period and some metadata about the animal their ID number.

Here is a subset of the dataset:

                  start                 end id uptake
 1: 2017-01-29 10:16:00 2017-01-29 12:25:00 14   2.04
 2: 2017-01-29 10:16:00 2017-01-29 12:25:00 21   1.53
 3: 2017-01-29 10:16:00 2017-01-29 12:25:00 12   0.17
 4: 2017-01-29 10:16:00 2017-01-29 12:25:00 20   1.19
 5: 2017-01-29 10:16:00 2017-01-29 12:25:00 19   0.85
 6: 2017-01-31 09:48:00 2017-01-31 11:59:00 21   5.27
 7: 2017-01-31 09:48:00 2017-01-31 11:59:00 34   1.87
 8: 2017-01-31 11:50:00 2017-01-31 14:59:00 21   1.00
 9: 2017-01-31 11:50:00 2017-01-31 14:59:00 34   0.14
10: 2017-01-31 11:50:00 2017-01-31 14:59:00 20   1.00
11: 2017-01-31 11:50:00 2017-01-31 14:59:00 19   0.43
12: 2017-01-31 11:50:00 2017-01-31 14:59:00 14   3.43
13: 2017-01-31 15:15:00 2017-01-31 16:21:00 12   1.00
14: 2017-01-31 15:15:00 2017-01-31 16:21:00 20   0.72
15: 2017-01-31 15:15:00 2017-01-31 16:21:00 14   0.86
16: 2017-01-31 15:15:00 2017-01-31 16:21:00 21   0.43
17: 2017-01-31 15:15:00 2017-01-31 16:21:00 19   0.57
18: 2017-02-01 09:55:00 2017-02-01 11:47:00 34   1.62
19: 2017-02-01 09:55:00 2017-02-01 11:47:00 21   3.06
20: 2017-02-01 12:03:00 2017-02-01 15:02:00 19   1.29
21: 2017-02-01 12:03:00 2017-02-01 15:02:00 14   3.86

Normally there is a maximum of one row per individual per day, as there was only one observation period. However, on some days there were multiple observation periods so that some ids have more than one row on these days. For these days I would like to collapse the multiple rows per individual while keeping the earliest start and latest end timestamp of the observation periods of that day, while summing up the uptake value but keeping the id value the same.

I am looking for something like this:

    X            start              end id uptake
1   1 2017-01-29 10:16 2017-01-29 12:25 14   2.04
2   2 2017-01-29 10:16 2017-01-29 12:25 21   1.53
3   3 2017-01-29 10:16 2017-01-29 12:25 12   0.17
4   4 2017-01-29 10:16 2017-01-29 12:25 20   1.19
5   5 2017-01-29 10:16 2017-01-29 12:25 19   0.85
6   6 2017-01-31 09:48 2017-01-31 16:21 21   6.70
7   7 2017-01-31 09:48 2017-01-31 16:21 34   2.01
8  10 2017-01-31 11:50 2017-01-31 16:21 20   1.72
9  11 2017-01-31 11:50 2017-01-31 16:21 19   1.00
10 12 2017-01-31 11:50 2017-01-31 16:21 14   4.29
11 13 2017-01-31 15:15 2017-01-31 16:21 12   1.00
12 18 2017-02-01 09:55 2017-02-01 15:02 34   1.62
13 19 2017-02-01 09:55 2017-02-01 15:02 21   3.06
14 20 2017-02-01 12:03 2017-02-01 15:02 19   1.29
15 21 2017-02-01 12:03 2017-02-01 15:02 14   3.86

CodePudding user response:

Within dplyr this is a task for group_by and summarize:

library(dplyr)
library(lubridate)

df |> 
  group_by(id, lubridate::date(start)) |>
  summarise(start = min(start),
            end = max(end),
            uptake = sum(uptake)) |>
  ungroup() |>
  arrange(start)

Output:

# A tibble: 15 × 5
      id date       start               end                 uptake
   <dbl> <date>     <dttm>              <dttm>               <dbl>
 1    12 2017-01-29 2017-01-29 10:16:00 2017-01-29 12:25:00   0.17
 2    14 2017-01-29 2017-01-29 10:16:00 2017-01-29 12:25:00   2.04
 3    19 2017-01-29 2017-01-29 10:16:00 2017-01-29 12:25:00   0.85
 4    20 2017-01-29 2017-01-29 10:16:00 2017-01-29 12:25:00   1.19
 5    21 2017-01-29 2017-01-29 10:16:00 2017-01-29 12:25:00   1.53
 6    21 2017-01-31 2017-01-31 09:48:00 2017-01-31 16:21:00   6.7 
 7    34 2017-01-31 2017-01-31 09:48:00 2017-01-31 14:59:00   2.01
 8    14 2017-01-31 2017-01-31 11:50:00 2017-01-31 16:21:00   4.29
 9    19 2017-01-31 2017-01-31 11:50:00 2017-01-31 16:21:00   1   
10    20 2017-01-31 2017-01-31 11:50:00 2017-01-31 16:21:00   1.72
11    12 2017-01-31 2017-01-31 15:15:00 2017-01-31 16:21:00   1   
12    21 2017-02-01 2017-02-01 09:55:00 2017-02-01 11:47:00   3.06
13    34 2017-02-01 2017-02-01 09:55:00 2017-02-01 11:47:00   1.62
14    14 2017-02-01 2017-02-01 12:03:00 2017-02-01 15:02:00   3.86
15    19 2017-02-01 2017-02-01 12:03:00 2017-02-01 15:02:00   1.29 

Data (please include data using dput next time):

library(readr)
library(dplyr)

df <- read_delim("idx,start,end,id,uptake
1:,2017-01-29 10:16:00,2017-01-29 12:25:00,14,2.04
2:,2017-01-29 10:16:00,2017-01-29 12:25:00,21,1.53
3:,2017-01-29 10:16:00,2017-01-29 12:25:00,12,0.17
4:,2017-01-29 10:16:00,2017-01-29 12:25:00,20,1.19
5:,2017-01-29 10:16:00,2017-01-29 12:25:00,19,0.85
6:,2017-01-31 09:48:00,2017-01-31 11:59:00,21,5.27
7:,2017-01-31 09:48:00,2017-01-31 11:59:00,34,1.87
8:,2017-01-31 11:50:00,2017-01-31 14:59:00,21,1.00
9:,2017-01-31 11:50:00,2017-01-31 14:59:00,34,0.14
10:,2017-01-31 11:50:00,2017-01-31 14:59:00,20,1.00
11:,2017-01-31 11:50:00,2017-01-31 14:59:00,19,0.43
12:,2017-01-31 11:50:00,2017-01-31 14:59:00,14,3.43
13:,2017-01-31 15:15:00,2017-01-31 16:21:00,12,1.00
14:,2017-01-31 15:15:00,2017-01-31 16:21:00,20,0.72
15:,2017-01-31 15:15:00,2017-01-31 16:21:00,14,0.86
16:,2017-01-31 15:15:00,2017-01-31 16:21:00,21,0.43
17:,2017-01-31 15:15:00,2017-01-31 16:21:00,19,0.57
18:,2017-02-01 09:55:00,2017-02-01 11:47:00,34,1.62
19:,2017-02-01 09:55:00,2017-02-01 11:47:00,21,3.06
20:,2017-02-01 12:03:00,2017-02-01 15:02:00,19,1.29
21:,2017-02-01 12:03:00,2017-02-01 15:02:00,14,3.86") |> select(-idx)
  • Related