For example, these are the days certain type of roles are present in an office
type | day_in | day_out |
---|---|---|
A | 1 | 10 |
A | 5 | 15 |
A | 31 | 35 |
B | 5 | 15 |
C | 10 | 20 |
C | 45 | 55 |
D | 41 | 50 |
I want the number of days the office is occupied. There is a continuous office presence from days 1 to 20, 31 to 35, and 41 to 45, so the answer I want is 40 days.
I have a solution based on pivoting the data and setting flags on day when the state switches between occupied and unoccupied , using a for
loop to cycle through each row. But I came to this solution reluctantly after failing to work out a vectorized approach.
Is there a vectorized way to do the operation from my for
loop? Or any ideas for different algorithms would also be welcome.
My solution with example data is below:
library(dplyr)
library(tidyr)
df_raw <- read.table(
header = TRUE,
text = "
type day_in day_out
A 1 10
A 5 15
A 31 35
B 5 15
C 10 20
C 45 55
D 41 50
"
)
# occupancy from day 1 to 20, 31 to 35 & 41 to 55 = 40 days
# Unoccupied for 15 days
df <- df_raw %>%
tidyr::pivot_longer(cols = c(day_in, day_out), names_to = "in_out", values_to = "day") %>%
arrange(day)
# Create these columns to prevent warning "Unknown or uninitialised column" later
df$current_types <- NA
df$flag <- NA
# Loop to create flags on day when occupancy switches from occupied to unoccupied or vice-versa
for (rown in 1:nrow(df)) {
df$current_types[rown] <- if (rown == 1) {
df$type[rown]
} else {
if (df$in_out[rown] == "day_in") {
paste(df$current_types[rown - 1], df$type[rown], collapse = " ")
} else {
trimws(gsub(paste0("\\s?", df$type[rown], "\\s?"), " ", df$current_types[rown - 1]))
}
}
# if there are no current type then unoccupied. It may or may not be occupied again afterwards.
df$flag[rown] <- if (rown == 1 | (df$in_out[rown] == "day_out" & nchar(df$current_types[rown]) == 0)) {
1
} else {
if (df$in_out[rown] == "day_in" & nchar(df$current_types[rown - 1]) == 0) 1 else 0
}
}
# Then filter the flags, "pivot" to get each occupancy start and end in one row and sum the total days occupied
df %>%
filter(flag == 1) %>%
mutate(
start = if_else(in_out == "day_out" & lag(in_out) == "day_in", dplyr::lag(day), NULL),
stop = if_else(in_out == "day_out", day, NULL)
) %>%
filter(in_out == "day_out") %>%
summarise(days_occupied = sum(stop - start 1))
CodePudding user response:
You can generate day sequences for each role and count the number of unique days:
length(unique(unlist(apply(df_raw[, c('day_in', 'day_out')],
1,
function(x) seq(x[1], x[2])))))
Or using pipes:
df_raw[, c('day_in', 'day_out')] %>%
apply(1, function(x) seq(x[1], x[2])) %>%
unlist %>%
unique %>%
length
CodePudding user response:
Another simple solution would be to create a vector with the size of your timespan and flag all occupied days and count them afterwards.
df <- data.frame(
type = c("A","A","A","B","C","C","D"),
day_in = c(1,5,31,5,10,45,41),
day_out = c(10,15,35,15,20,55,50))
occupation <- rep(0, max(df$day_out))
for(i in 1:nrow(df)){
occupation[df[i,'day_in']:df[i,'day_out']] <- 1
}
# 40
sum(occupation)