Home > Software engineering >  Two random unique samples of the same pool
Two random unique samples of the same pool

Time:04-22

I am trying to get two samples with unique elements in each sample. That is, the strings on the "first" vector cannot be in the "second" vector. Unfortunately, I always get repeated strings and I can't see to find a way of solving this. I tried to solve using if-else, but with no success.

edit: the final output should be pairs. The same numbers in first should be in second. The only thing that will vary is the letters. Each letter have to appear exactly three times. The reason I don't want repeated elements, is that when I am creating the pairs, I get pairs such as 1_W and 1_W. That cannot happen.

The output should be something like:

first: 12_U, 23_U, 6_U, 8_T, 24_T, 22_T, 7_S, 10_S, 19_S, 21_W, 14_W, 2_W

second: 12_W, 23_W, 6_W, 8_S, 24_S, 22_S, 7_T, 10_T, 19_T, 21_U, 14_U, 2_U

Edit 2:

I did a terrible job at explaining what I need. This code is going to be used to select headlines for a study I'm going to collect data.

Each theme represents a headline about a specific topic, such as global warming. There are 24 themes. Each version (U, T, S, W) represents variations of a true headline (T).

I have a headlines bank with a total of 96 headlines that varies in terms of themes and versions. 1_U is the U version of theme 1. I want to check which versions participants will choose for each pair.

What I need is

  1. to select 12 themes;
  2. to create pairs within the same theme so participants can choose between two versions of the same headline.
  3. participants need to see always: 12 pairs (2 versions of the same theme). I also need to guarantee that they will see equal proportions of each version. That's why I created vector “first” and vector “second” that meet this criteria.

However I am getting pairs with repeated versions. Therefore, some pairs I am getting is 12_S and 12_S, when they should be 12_S and any other version (12_U, 12_S or 12_W) because it does not make sense for a participant to choose between the S version of theme 12 and the S version of theme 12.

By creating two vectors I was able to get exactly what I wanted except for the fact that some pairs contain the same headline.

themes <- c(1:24)
set.seed(1)
twelve <- sample(themes, 12)
versions <- c('U', 'T', 'S', 'W')

set.seed(14) 
first <- sample(paste(sample(twelve), rep(versions, 3), sep='_'))
second <- sample(paste(sample(twelve), rep(versions, 3), sep='_'))

repeated <- first[first %in% second]

if (is.null(repeated)) {
  print(second) #if there are no elements in the vector "repeated", then print repeated
} else {
  x <- sample(paste(sample(twelve), rep(versions, 3), sep='_')) #otherwise, pick another sample
}

CodePudding user response:

To make sure you get 2 vectors first and second where themes in first do not exist in second you either need repeated themes within a vector, or you must use sampling to split the themes up.

set.seed(1)
themes <- 1:24
versions <- c('U', 'T', 'S', 'W')
split_idx <- sample(length(themes), 0.5*length(themes))
set_1 <- themes[split_idx]
set_2 <- themes[-split_idx]

Which creates 2 unique samples, verified by

set_1 %in% set_2

Which should return a boolean vector with only FALSE entries.

If you only want 3 letters in the final 2 vectors I suggest the following:

first <- paste(sample(set_1), sample(versions, 3), sep = "_")
secnd <- paste(sample(set_2), sample(versions, 3), sep = "_")

The usage of rep(versions, 3) is unnecessary, as R automatically replicates if one vector is shorter.

To get new vectors with changing themes that preserve these properties, you must split themes again into 2 sets.

CodePudding user response:

I think you make your life easier to sample your pairs (with no duplicates) and then paste with your theme value. So we first sample 12 themes, then apply over that list and paste it with your pair of versions. You get a matrix with 2 rows with your pairs.

set.seed(1)

themes <- 1:24
versions <- c("U", "T", "S", "W")

pairs <- sapply(sample(themes, 12), FUN = function(x) paste(x, sample(versions, 2), sep = "_"))

pairs
#      [,1]  [,2]  [,3]  [,4]  [,5]   [,6]   [,7]   [,8]   [,9]  [,10]  [,11]  [,12]
# [1,] "4_T" "7_S" "1_S" "2_U" "11_U" "14_U" "18_T" "22_T" "5_W" "16_U" "10_T" "6_T"
# [2,] "4_W" "7_U" "1_U" "2_W" "11_T" "14_W" "18_W" "22_U" "5_S" "16_S" "10_W" "6_W"

first <- pairs[1, ]
# [1] "4_T"  "7_S"  "1_S"  "2_U"  "11_U" "14_U" "18_T" "22_T" "5_W"  "16_U" "10_T" "6_T" 

second <- pairs[2, ]
# [1] "4_W"  "7_U"  "1_U"  "2_W"  "11_T" "14_W" "18_W" "22_U" "5_S"  "16_S" "10_W" "6_W"

CodePudding user response:

Here a brute force approach. I would create two samples for two themes the 12 participants choose from. sample the versions in the same way. repeat until there is no dupe for each participant in both (i.e. in each row of the resulting matrices). Next, copy rows of samp_vs each two times and paste both together using Map. Wrap it in a function samp_fun.

samp_fun <- \(themes, versions) {
  themes_12 <- sample(themes, 12)
  repeat {
    samp_th <- replicate(2, sample(themes_12))
    samp_vs <- replicate(2, sample(versions))
    if (!any(apply(samp_th, 1, duplicated)) &
        !any(apply(samp_vs, 1, duplicated))) break
  }
  samp_vs <- samp_vs[rep(seq_len(nrow(samp_vs)), each=3), ]
  Map(\(...) paste(..., sep='_'),
      as.data.frame(samp_th), as.data.frame(samp_vs)) |>
    setNames(c('first', 'second'))
}

Usage

themes <- 1:24
versions <- c('U', 'T', 'S', 'W')

set.seed(42)
res <- samp_fun(themes, versions)

Result

Gives a list with the two groups.

res$first
# [1] "4_S"  "15_S" "9_S"  "18_T" "5_T"  "20_T"
# [7] "17_W" "24_W" "8_W"  "7_U"  "1_U"  "10_U"

res$second
# [1] "15_U" "4_U"  "10_U" "8_W"  "7_W"  "24_W"
# [7] "5_S"  "18_S" "1_S"  "17_T" "9_T"  "20_T"

If you want first, second in workspace, use list2env.

list2env(res, .GlobalEnv)
first
second

Note: R >= 4.1 used.

  • Related