Efficient recursive random sampling-CodePudding

Imagine a df in the following format:

The problem is to randomly select one row for the first unique value in ID1, remove the corresponding ID2 value from the dataset, randomly select a value from the remaining pool of ID2 values for the second ID1 value (i.e. recursively), and so on.

So, for example, for the first ID1 value, it would do sample(1:5, 1), with the result 2. For the second ID1 value, it would do sample(c(1, 3:5), 1), with the result 3. For the third ID1 value, it would do sample(c(1, 4:5), 1), with the result 5. It cannot happen that there are no unique ID2 values left to assign to a particular ID1. In the end, the results should have a similar format:

It should be efficient enough to handle reasonably large datasets (tens of thousands unique values in ID1 and hundreds of thousands unique values per ID2).

I tried multiple ways to solve this problem, but honestly none of them are meaningful and would likely only contribute to confusion, so I'm not sharing them here.

Sample data:

df <- data.frame(ID1 = rep(LETTERS[1:3], each = 5),
                 ID2 = rep(1:5, 3))

CodePudding user response：

I think this algorithm does what you want, but it's not very efficient. It may provide others with a starting point for faster solutions.

all_ID1 <- unique(df$ID1)
available <- unique(df$ID2)
new_ID2 <-  numeric(length(all_ID1))

for(i in seq_along(all_ID1))
{
  ID2_group <- df$ID2[df$ID1 == all_ID1[i]]
  sample_space <- ID2_group[ID2_group %in% available]
  new_ID2[i]<- sample(sample_space, 1)
  available <- available[available != new_ID2[i]]
}

data.frame(ID1 = all_ID1, ID2 = new_ID2)
#>   ID1 ID2
#> 1   A   5
#> 2   B   1
#> 3   C   2

Note that this will not work if you run out of unique ID2 values. For example, if you had letters A:F in the ID1 column, each with ID2 values of 1:5, then by the time you get to selecting an ID2 value for the ID1 value "F", there are no unique ID2 values left, since numbers 1 to 5 have all been assigned to letters A:E. You don't state in your question what should happen when there are no unique ID2 values left to assign to a particular ID1 - should they be NA, or are repeats allowed at that point?

^{Created on 2021-11-03 by the reprex package (v2.0.0)}

CodePudding user response：

selected <- c()

for(i in unique(df[,1])) {

    x <- df[df[,"ID1"]==i,"ID2"]

    y <- setdiff(x,selected)
    selected <- unique(c(sample(y,1),selected))
    

}

data.frame(ID1 = unique(df[,1]), ID2 =selected)

gives,

CodePudding user response：

A possible approach

library(data.table)
setDT(df)
exclude.values <- as.numeric()
L <- split(df, by = "ID1")
ans <- lapply(L, function(x) {
  sample.value <- sample(setdiff(x$ID2, exclude.values), 1)
  exclude.values <<- c(exclude.values, sample.value)
  return(sample.value)
})

CodePudding user response：

You can try the code below (Reduce is applied to recursively adding unvisited ID2 values)

lst <- split(df, ~ID1)
Reduce(
  function(x, y) {
    y <- subset(y,!ID2 %in% x$ID2)
    rbind(x, y[sample(nrow(y), 1), ])
  },
  lst[-1],
  init = lst[[1]][sample(1:nrow(lst[[1]]), 1), ],
)

which gives