Home > Software design >  add grouping number based on consecutive TRUE/FALSE column values
add grouping number based on consecutive TRUE/FALSE column values

Time:11-09

Given this data frame:

library(dplyr)
dat <- data.frame(
  bar = c(letters[1:10]),
  foo = c(1,2,3,5,8,9,11,13,14,15)
)
   bar foo
1    a   1
2    b   2
3    c   3
4    d   5
5    e   8
6    f   9
7    g  11
8    h  13
9    i  14
10   j  15

I first want to identify groups, if the foo number is consecutive:

dat <- dat %>% mutate(in_cluster = 
                 ifelse( lead(foo) == foo  1 | lag(foo) == foo -1, 
                         TRUE, 
                         FALSE))

Which leads to the following data frame:

   bar foo in_cluster
1    a   1       TRUE
2    b   2       TRUE
3    c   3       TRUE
4    d   5      FALSE
5    e   8       TRUE
6    f   9       TRUE
7    g  11      FALSE
8    h  13       TRUE
9    i  14       TRUE
10   j  15       TRUE

As can be seen, the values 1,2,3 form a group, then value 5 is on it's own and does not belong to a cluster, then values 8,9 form another cluster and so on.

I would like to add cluster numbers to these "groups".

Expected output:

   bar foo in_cluster cluster_number
1    a   1       TRUE              1
2    b   2       TRUE              1
3    c   3       TRUE              1
4    d   5      FALSE             NA
5    e   8       TRUE              2
6    f   9       TRUE              2
7    g  11      FALSE             NA
8    h  13       TRUE              3
9    i  14       TRUE              3
10   j  15       TRUE              3

CodePudding user response:

There is probably a better tidverse approach for something like this. For example, group_indices could be used if in_cluster is defined through an arbitrary length case_when. However, we can also implement our own method to specifically deal with logical value run lengths, using the rle function.

solution 1 (R version > 3.5)

lgl_indices <- function(var){
  x <- rle(var)
  cumsum(x[[2]]) |> (\(.){ .[which(!x[[2]], T)] <- NA ; .})() |> rep(x[[1]])
}

solution 2

lgl_indices <- function(var){
  x <- rle(var)
  y <- cumsum(x$values)
  y[which(x$values == F)] <- NA
  rep(y, x$lengths)
}

solution 3

lgl_indices <- function(var){
  x <- rle(var)
  l <- vector("list", length(x))
  n <- 1L
  for (i in seq_along(x[[1]])) {
    if(!x$values[i]) grp <- NA else {
      grp <- n
      n <- n   1L
    }
    l[[i]] <- rep(grp, x$lengths[i])
  }
  Reduce(c, l)
}

dat %>%
  mutate(cluster_number = lgl_indices(in_cluster))

  bar foo in_cluster cluster_number
1   a   1       TRUE              1
2   b   2       TRUE              1
3   c   3       TRUE              1
4   d   5      FALSE             NA
5   e   8       TRUE              2
6   f   9       TRUE              2

CodePudding user response:

This may not be the efficient way. Still, this works:

# Cumuative sum of the logical
dat$new_cluster <- cumsum(!dat$in_cluster) 1

# using the in_cluster to subset and replacing the cluster number for FALSE by NA
dat[!dat$in_cluster,]$new_cluster <- NA
dat
  bar foo in_cluster new_cluster
1   a   1       TRUE           1
2   b   2       TRUE           1
3   c   3       TRUE           1
4   d   5      FALSE          NA
5   e   8       TRUE           2
6   f   9       TRUE           2
  • Related