Assign groups in dataframe using vectors containing start and end indexes-CodePudding

I have a simple data frame:

df <- data.frame(X = LETTERS[1:20], 
                 Y = paste0("abc_", 1:20))
#  X     Y
#1 A abc_1
#2 B abc_2
#3 C abc_3
#4 D abc_4
#5 E abc_5
#6 F abc_6
# ...

I need to assign group ID based on two vectors of integers. One indicates where the group starts, the other indicates where the group ends:

start_ix <- c(2, 5, 8, 10, 15, 18)
end_ix   <- c(4, 7, 9, 13, 17, 19)

i.e. the first group is rows 2 through 4, the second is row 5 through 7, and so on. Any row not contained in these indexes (or the span between the start and stop values) should be NA.

The desired outcome would be:

df_want <- structure(list(X = c("A", "B", "C", "D", "E", "F", "G", "H", 
"I", "J", "K", "L", "M", "N", "O", "P", "Q", "R", "S", "T"), 
    Y = c("abc_1", "abc_2", "abc_3", "abc_4", "abc_5", "abc_6", 
    "abc_7", "abc_8", "abc_9", "abc_10", "abc_11", "abc_12", 
    "abc_13", "abc_14", "abc_15", "abc_16", "abc_17", "abc_18", 
    "abc_19", "abc_20"), grp = c(NA, 1, 1, 1, 2, 2, 2, 3, 3, 
    4, 4, 4, 4, NA, 5, 5, 5, 6, 6, NA)), row.names = c(NA, -20L
), class = "data.frame")

#    X      Y grp
# 1  A  abc_1  NA
# 2  B  abc_2   1
# 3  C  abc_3   1
# 4  D  abc_4   1
# 5  E  abc_5   2
# 6  F  abc_6   2
# 7  G  abc_7   2
# 8  H  abc_8   3
# 9  I  abc_9   3
# 10 J abc_10   4
# 11 K abc_11   4
# 12 L abc_12   4
# 13 M abc_13   4
# 14 N abc_14  NA
# 15 O abc_15   5
# 16 P abc_16   5
# 17 Q abc_17   5
# 18 R abc_18   6
# 19 S abc_19   6
# 20 T abc_20  NA

The solution in my specific case would need to be done in base R, but for the sake of others who may have the same issue feel free to post solutions from external packages.

I have tried a combination of indexing, sorting, and seq but can't seem to come up with a solution.

CodePudding user response：

While there may be better solutions, one potential solution which Vectorizes the seq function to index the rows, then uses a vector from the difference in start and end positions in rep to identify the groups:

#Index
seqV <- Vectorize(seq.default, vectorize.args = c("to", "from"))
ix <- unlist(seqV(start_ix, end_ix))

#Assign groups
df[ix, "grp"] <- rep(1:length(start_ix), (end_ix - start_ix)   1)

# Validate
all.equal(df_want, df)
# [1] TRUE

CodePudding user response：

Using for loop:

for(i in seq_along(start_ix)){
  df[ start_ix[ i ]:end_ix[ i ], "grp"] <- i
  }

Another option, range overlap, using data.table::foverlaps:

library(data.table)

df1 <- cbind(data.table(start = seq(nrow(df)), end = seq(nrow(df))), df)
df2 <- data.table(start = start_ix, end = end_ix, grp = seq_along(start_ix))

setkey(df1, start, end)
setkey(df2, start, end)

foverlaps(df1, df2)[, .(X, Y, group)]
#     X      Y   grp
#  1: A  abc_1    NA
#  2: B  abc_2     1
#  3: C  abc_3     1
#  4: D  abc_4     1
#  5: E  abc_5     2
#  etc...

CodePudding user response：

Fully vectorized base R solution, using indexing, outer, findInterval, and sort:

df$grp <- outer(c(NA, 1), 1:nrow(df))[
  findInterval(
    1:nrow(df),
    sort(
      c(
        start_ix,
        end_ix   0.1
      )
    )
  )   1L
]