Imagine a df in the following format:
ID1 ID2
1 A 1
2 A 2
3 A 3
4 A 4
5 A 5
6 B 1
7 B 2
8 B 3
9 B 4
10 B 5
11 C 1
12 C 2
13 C 3
14 C 4
15 C 5
The problem is to randomly select one row for the first unique value in ID1, remove the corresponding ID2 value from the dataset, randomly select a value from the remaining pool of ID2 values for the second ID1 value (i.e. recursively), and so on.
So, for example, for the first ID1 value, it would do sample(1:5, 1)
, with the result 2. For the second ID1 value, it would do sample(c(1, 3:5), 1)
, with the result 3. For the third ID1 value, it would do sample(c(1, 4:5), 1)
, with the result 5. It cannot happen that there are no unique ID2 values left to assign to a particular ID1. In the end, the results should have a similar format:
ID1 ID2
1 A 2
2 B 3
3 C 5
It should be efficient enough to handle reasonably large datasets (tens of thousands unique values in ID1 and hundreds of thousands unique values per ID2).
I tried multiple ways to solve this problem, but honestly none of them are meaningful and would likely only contribute to confusion, so I'm not sharing them here.
Sample data:
df <- data.frame(ID1 = rep(LETTERS[1:3], each = 5),
ID2 = rep(1:5, 3))
CodePudding user response:
I think this algorithm does what you want, but it's not very efficient. It may provide others with a starting point for faster solutions.
all_ID1 <- unique(df$ID1)
available <- unique(df$ID2)
new_ID2 <- numeric(length(all_ID1))
for(i in seq_along(all_ID1))
{
ID2_group <- df$ID2[df$ID1 == all_ID1[i]]
sample_space <- ID2_group[ID2_group %in% available]
new_ID2[i]<- sample(sample_space, 1)
available <- available[available != new_ID2[i]]
}
data.frame(ID1 = all_ID1, ID2 = new_ID2)
#> ID1 ID2
#> 1 A 5
#> 2 B 1
#> 3 C 2
Note that this will not work if you run out of unique ID2 values. For example, if you had letters A:F in the ID1 column, each with ID2 values of 1:5, then by the time you get to selecting an ID2 value for the ID1 value "F", there are no unique ID2 values left, since numbers 1 to 5 have all been assigned to letters A:E. You don't state in your question what should happen when there are no unique ID2 values left to assign to a particular ID1 - should they be NA
, or are repeats allowed at that point?
Created on 2021-11-03 by the reprex package (v2.0.0)
CodePudding user response:
selected <- c()
for(i in unique(df[,1])) {
x <- df[df[,"ID1"]==i,"ID2"]
y <- setdiff(x,selected)
selected <- unique(c(sample(y,1),selected))
}
data.frame(ID1 = unique(df[,1]), ID2 =selected)
gives,
ID1 ID2
1 A 4
2 B 2
3 C 3
CodePudding user response:
A possible approach
library(data.table)
setDT(df)
exclude.values <- as.numeric()
L <- split(df, by = "ID1")
ans <- lapply(L, function(x) {
sample.value <- sample(setdiff(x$ID2, exclude.values), 1)
exclude.values <<- c(exclude.values, sample.value)
return(sample.value)
})
CodePudding user response:
You can try the code below (Reduce
is applied to recursively adding unvisited ID2
values)
lst <- split(df, ~ID1)
Reduce(
function(x, y) {
y <- subset(y,!ID2 %in% x$ID2)
rbind(x, y[sample(nrow(y), 1), ])
},
lst[-1],
init = lst[[1]][sample(1:nrow(lst[[1]]), 1), ],
)
which gives
ID1 ID2
4 A 4
7 B 2
11 C 1