Creating group ids by comparing values of two variables across rows: in R-CodePudding

I have a dataframe with two variables (start,end). would like to create an identifier variable which grows in ascending order of start and, most importantly, is kept constant if the value of start coincides with end of any other row in the dataframe.

Below is a simple example of the data

toy_data <- data.frame(start = c(1,5,6,10,16),
                      end = c(10,9,11,15,17))

The output I would be looking for is the following:

output_data <- data.frame(start = c(1,10,5,6,16),
                   end = c(10,15,9,11,17),
                   NEW_VAR = c(1,1,2,3,4))

CodePudding user response：

The following function should give you the desired identifier variable NEW_VAR.

identifier <- \(df) {
  x <- array(0L, dim = nrow(df))
  count <- 0L
  my_seq <- seq_len(nrow(df))
  for (i in my_seq) {
    if(!df[i,]$start %in% df$end) {
      x[i] <- my_seq[i]   count
    } else {
      x[i] <- my_seq[i]-1L   count
      count <- count - 1L
    }
  }
  x
}

Example

toy_data <- data.frame(start = c(1, 2, 2, 4, 16, 21, 18, 3),
                       end = c(16, 2, 21, 2, 2, 2, 3, 1))
toy_data$NEW_VAR <- identifier(toy_data)
# ---------------------
> toy_data$NEW_VAR
[1] 0 0 0 1 1 1 2 2

CodePudding user response：

You could try adapting this answer to group by ranges that are adjacent to each other. Credit goes entirely to @r2evans.

In this case, you would use expand.grid to get combinations of start and end. Instead of labels you would have row numbers rn to reference.

In the end, you can number the groups based on which rows appear together in the list. The last few lines starting with enframe use tibble/tidyverse. To match the group numbers I resorted the results too.

I hope this might be helpful.

library(tidyverse)

toy_data <- data.frame(start = c(1,5,6,10,16),
                       end = c(10,9,11,15,17))

toy_data$rn = 1:nrow(toy_data)

eg <- expand.grid(a = seq_len(nrow(toy_data)), b = seq_len(nrow(toy_data)))
eg <- eg[eg$a < eg$b,]

together <- cbind(
  setNames(toy_data[eg$a,], paste0(names(toy_data), "1")),
  setNames(toy_data[eg$b,], paste0(names(toy_data), "2"))
)

together <- subset(together, end1 == start2)

groups <- split(together$rn2, together$rn1)

for (i in toy_data$rn) {
  ind <- (i == names(groups)) | sapply(groups, `%in%`, x = i)
  vals <- groups[ind]
  groups <- c(
    setNames(list(unique(c(i, names(vals), unlist(vals)))), i),
    groups[!ind]
  )
}

min_row <- as.numeric(sapply(groups, min))
ctr <- seq_along(groups)

lapply(ctr[order(match(min_row, ctr))], \(x) toy_data[toy_data$rn %in% groups[[x]], ]) %>%
  enframe() %>%
  unnest(col = value) %>%
  select(-rn)

Output

   name start   end
  <int> <dbl> <dbl>
1     1     1    10
2     1    10    15
3     2     5     9
4     3     6    11
5     4    16    17