Home > Software engineering >  Using dplyr, group data by id and date, find first and last location
Using dplyr, group data by id and date, find first and last location

Time:02-03

EDIT: Sorry, it seems my example data was too simple/nice. The full data set is much larger. I cannot recover the order of events by ordering by date or anything else. And the on and off are ids, not event numbers, so do not have an order either. I've updated the example to better reflect this.

Here is some example data:

ids <- c(1, 1, 1, 2, 2, 2)
date <- c(1,1,1, 3,3,3)
off <- c(234,234,93, 675,876,876) # these are ids
on <- c(93,111,234, 876,675,675) # these are ids

df <- data.frame(ids, dates, on, off)

This represents journeys, ie individual 1 goes from 234 -> 93 -> 234 -> 111 individual 2 goes from 876 -> 675 -> 876 -> 675

The date information is not detailed enough to order the records on their own. I cannot just take first and last.

Grouping the data by id and date, and I want identify where the first off location was, and the last on location was, and aggregate this into one record.

I would expect an outcome in this instance of

ids <- c(1, 2)
date <- c(1,3)
off <- c(234, 111)
on <- c(876, 675)

I have tried many things but none have worked correctly.

CodePudding user response:

It looks like your logic is that for each id, you just want the minimum value for off and the maximum value for on, so this should do it.

library(dplyr, warn.conflicts = FALSE)
ids <- c(1, 1, 1, 2, 2, 2)
date <- c(1,1,1, 3,3,3)
off <- c(111,234,111, 675,876,675)
on <- c(234,111,876, 876,675,876)

df <- data.frame(ids = ids, date = date, on = on, off = off)
df %>%
  group_by(ids) %>%
  filter(on == max(on), off == min(off)) %>%
  distinct()
#> # A tibble: 2 × 4
#> # Groups:   ids [2]
#>     ids  date    on   off
#>   <dbl> <dbl> <dbl> <dbl>
#> 1     1     1   876   111
#> 2     2     3   876   675

Created on 2023-02-02 by the reprex package (v2.0.1)

CodePudding user response:

you may also use group_by and slice_head

data.frame(ids, date, on, off) %>% arrange(ids,date, on, off) %>% group_by(ids) %>% 
slice_head(n=1)

Created on 2023-02-02 with reprex v2.0.2

# A tibble: 2 × 4
# Groups:   ids [2]
    ids  date    on   off
  <dbl> <dbl> <dbl> <dbl>
1     1     1   111   234
2     2     3   675   876

CodePudding user response:

The first and last line of each group determines the beginning and the end. Therefore, select them and summarize the data. For example:

library(dplyr)
library(tidyr)

df %>% 
  group_by(ids, date) %>% 
  mutate(start = case_when(row_number() == 1 ~ off),
         end = case_when(row_number() == n() ~ on)) %>% 
  select(-on, -off) %>% 
  filter(!(is.na(start) & is.na(end))) %>% 
  fill(start, .direction="down") %>% 
  fill(end, .direction="up") %>% 
  distinct()

This ends up with:

# A tibble: 2 × 4
# Groups:   ids, date [2]
    ids  date start   end
  <dbl> <dbl> <dbl> <dbl>
1     1     1   111   876
2     2     3   675   876
  • Related