I have a data set that contains movements of groups between different locations, which looks like this:
library(data.table)
df = data.table(Date = c('01/01/2021', '06/01/2021', '09/01/2021', '10/01/2021', '20/01/2021', '24/01/2021'),
Group = c('Group A', 'Group A', 'Group B', 'Group B', 'Group B', 'Group A'),
From = c('NA', 1, 3, 4, 5, 4),
To = c(1, 'NA', 4, 5, 'NA', 1))
Date Group From To
01/01/2021 Group A NA 1
06/01/2021 Group A 1 NA
09/01/2021 Group B 3 4
10/01/2021 Group B 4 5
20/01/2021 Group B 5 NA
24/01/2021 Group A 4 1
Now, I would like to create a data table for each group/location combination the time range that was spent at the specific location. In the case the "From" column contains NA, this group just entered the movement process. When the "To" column contains NA, this group exited the movement process. The desired data would then look like:
Group Date_entry Date_exit Location
Group A 01/01/2021 06/01/2021 1
Group B 09/01/2021 10/01/2021 4
Group B 10/01/2021 20/01/2021 5
Group A 24/01/2021 NA 1
CodePudding user response:
I recommend an approach like the following:
df %>%
inner_join(
df,
by = c("Group" = "Group", "To" = "From"),
suffix = c("_F", "_T")
) %>%
select(Group, Date_F, Date_T, Location = To) %>%
filter(Location != "NA") %>%
mutate(Date_T = ifelse(Date_F < Date_T, Date_T, NA)) %>%
group_by(Group, Location, Date_F) %>%
summarise(Date_T = min(Date_T), .groups = "drop")
This may be a more general approach than you need - it is designed to handle many different rows per Group-Location pair.
How it works:
- Join
df
to itself, so we have records for all date pairs where a group enters and leaves a location. Note that these dates may be out of order, and dates may be repeated. - Discard non-locations
- Replace end-dates that are before start-dates with NA
- Find the minimum remaining end-date (that is after the start-date)