EDIT: Sorry, it seems my example data was too simple/nice. The full data set is much larger. I cannot recover the order of events by ordering by date or anything else. And the on and off are ids, not event numbers, so do not have an order either. I've updated the example to better reflect this.
Here is some example data:
ids <- c(1, 1, 1, 2, 2, 2)
date <- c(1,1,1, 3,3,3)
off <- c(234,234,93, 675,876,876) # these are ids
on <- c(93,111,234, 876,675,675) # these are ids
df <- data.frame(ids, dates, on, off)
This represents journeys, ie individual 1 goes from 234 -> 93 -> 234 -> 111 individual 2 goes from 876 -> 675 -> 876 -> 675
The date information is not detailed enough to order the records on their own. I cannot just take first and last.
Grouping the data by id and date, and I want identify where the first off location was, and the last on location was, and aggregate this into one record.
I would expect an outcome in this instance of
ids <- c(1, 2)
date <- c(1,3)
off <- c(234, 111)
on <- c(876, 675)
I have tried many things but none have worked correctly.
CodePudding user response:
It looks like your logic is that for each id, you just want the minimum value for off and the maximum value for on, so this should do it.
library(dplyr, warn.conflicts = FALSE)
ids <- c(1, 1, 1, 2, 2, 2)
date <- c(1,1,1, 3,3,3)
off <- c(111,234,111, 675,876,675)
on <- c(234,111,876, 876,675,876)
df <- data.frame(ids = ids, date = date, on = on, off = off)
df %>%
group_by(ids) %>%
filter(on == max(on), off == min(off)) %>%
distinct()
#> # A tibble: 2 × 4
#> # Groups: ids [2]
#> ids date on off
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 876 111
#> 2 2 3 876 675
Created on 2023-02-02 by the reprex package (v2.0.1)
CodePudding user response:
you may also use group_by
and slice_head
data.frame(ids, date, on, off) %>% arrange(ids,date, on, off) %>% group_by(ids) %>%
slice_head(n=1)
Created on 2023-02-02 with reprex v2.0.2
# A tibble: 2 × 4
# Groups: ids [2]
ids date on off
<dbl> <dbl> <dbl> <dbl>
1 1 1 111 234
2 2 3 675 876
CodePudding user response:
The first and last line of each group determines the beginning and the end. Therefore, select them and summarize the data. For example:
library(dplyr)
library(tidyr)
df %>%
group_by(ids, date) %>%
mutate(start = case_when(row_number() == 1 ~ off),
end = case_when(row_number() == n() ~ on)) %>%
select(-on, -off) %>%
filter(!(is.na(start) & is.na(end))) %>%
fill(start, .direction="down") %>%
fill(end, .direction="up") %>%
distinct()
This ends up with:
# A tibble: 2 × 4
# Groups: ids, date [2]
ids date start end
<dbl> <dbl> <dbl> <dbl>
1 1 1 111 876
2 2 3 675 876