R - Sum total time of multiple overlapping and/or discontinuous periods-CodePudding

For example, these are the days certain type of roles are present in an office

type	day_in	day_out
A	1	10
A	5	15
A	31	35
B	5	15
C	10	20
C	45	55
D	41	50

I want the number of days the office is occupied. There is a continuous office presence from days 1 to 20, 31 to 35, and 41 to 45, so the answer I want is 40 days.

I have a solution based on pivoting the data and setting flags on day when the state switches between occupied and unoccupied , using a for loop to cycle through each row. But I came to this solution reluctantly after failing to work out a vectorized approach.

Is there a vectorized way to do the operation from my for loop? Or any ideas for different algorithms would also be welcome.

My solution with example data is below:

library(dplyr)
library(tidyr)
df_raw <- read.table(
  header = TRUE,
  text = "
type day_in day_out
A  1 10
A  5 15
A 31 35
B  5 15
C 10 20
C 45 55
D 41 50 
"
)
# occupancy from day 1 to 20, 31 to 35 & 41 to 55 = 40 days
# Unoccupied for 15 days

df <- df_raw %>%
  tidyr::pivot_longer(cols = c(day_in, day_out), names_to = "in_out", values_to = "day") %>%
  arrange(day)

# Create these columns to prevent warning "Unknown or uninitialised column" later
df$current_types <- NA
df$flag <- NA

# Loop to create flags on day when occupancy switches from occupied to unoccupied or vice-versa
for (rown in 1:nrow(df)) {
  df$current_types[rown] <- if (rown == 1) {
    df$type[rown]
  } else {
    if (df$in_out[rown] == "day_in") {
      paste(df$current_types[rown - 1], df$type[rown], collapse = " ")
    } else {
      trimws(gsub(paste0("\\s?", df$type[rown], "\\s?"), " ", df$current_types[rown - 1]))
    }
  }
  # if there are no current type then unoccupied. It may or may not be occupied again afterwards.
  df$flag[rown] <- if (rown == 1 | (df$in_out[rown] == "day_out" & nchar(df$current_types[rown]) == 0)) {
    1
  } else {
    if (df$in_out[rown] == "day_in" & nchar(df$current_types[rown - 1]) == 0) 1 else 0
  }
}

# Then filter the flags, "pivot" to get each occupancy start and end in one row and sum the total days occupied

df %>%
  filter(flag == 1) %>%
  mutate(
    start = if_else(in_out == "day_out" & lag(in_out) == "day_in", dplyr::lag(day), NULL),
    stop = if_else(in_out == "day_out", day, NULL)
  ) %>%
  filter(in_out == "day_out") %>%
  summarise(days_occupied = sum(stop - start   1))

CodePudding user response：

You can generate day sequences for each role and count the number of unique days:

length(unique(unlist(apply(df_raw[, c('day_in', 'day_out')], 
                           1, 
                           function(x) seq(x[1], x[2])))))

Or using pipes:

df_raw[, c('day_in', 'day_out')] %>%
  apply(1, function(x) seq(x[1], x[2])) %>%
  unlist %>%
  unique %>%
  length

CodePudding user response：

Another simple solution would be to create a vector with the size of your timespan and flag all occupied days and count them afterwards.

df <- data.frame(
  type = c("A","A","A","B","C","C","D"),
  day_in = c(1,5,31,5,10,45,41),
  day_out = c(10,15,35,15,20,55,50))

occupation <- rep(0, max(df$day_out))

for(i in 1:nrow(df)){
  occupation[df[i,'day_in']:df[i,'day_out']] <- 1
}

# 40
sum(occupation)