I had difficulty finding the answer for this, so I figured I would make a new query. I am trying to figure out how to make a conditional random sample of a dataset. For simplicity, I have this data frame which has one variable, food, with three different levels: apple 1, apple 1, and banana. I'm considering a scenario where they are not so neatly distributed in the data frame and are more random, but this is what I have so far:
df <- data.frame(food = rep(c("apple.1",
"apple.2",
"banana"),
500))
head(df,10)
Which gives me this if printed:
food
1 apple.1
2 apple.2
3 banana
4 apple.1
5 apple.2
6 banana
7 apple.1
8 apple.2
9 banana
10 apple.1
Now sampling without replacement from there is easy enough with slice_sample
:
df %>%
slice_sample(n=10)
Which gives me what I need on that front:
food
1 apple.1
2 banana
3 apple.1
4 apple.2
5 apple.2
6 apple.2
7 banana
8 apple.2
9 banana
10 apple.1
However, let's say apple.1
and apple.2
come in pairs from a store, and we only want to pick one apple from each pair. If we pick both apples, it becomes less random due to age effects, environmental factors related to packaging, etc. So what I would like to do is make a conditional sample, where if I randomly pick fruit from a theoretical fruit basket, I am only selecting bananas and one of each pair of apples. So what can I do to accomplish this in R?
Edit
I wasn't as specific in my question as I probably should have been. For my specific query, I also need a way to uniquely identify which pair each apple comes from. So if Apple 1 and Apple 2 both come from Basket 67, I would like a way to uniquely identify that so I can check for duplicates.
I have included this very simple version of a dataset I'm thinking of:
structure(list(Basket = c(1L, 1L, 2L, 3L, 3L, 4L, 5L, 5L, 6L,
7L, 7L, 8L, 9L, 9L, 10L), Fruit = c("Apple.1", "Apple.2", "Banana",
"Apple.1", "Apple.2", "Banana", "Apple.1", "Apple.2", "Banana",
"Apple.1", "Apple.2", "Banana", "Apple.1", "Apple.1", "Banana"
)), class = "data.frame", row.names = c(NA, -15L))
Which looks like this:
Basket Fruit
1 1 Apple.1
2 1 Apple.2
3 2 Banana
4 3 Apple.1
5 3 Apple.2
6 4 Banana
7 5 Apple.1
8 5 Apple.2
9 6 Banana
10 7 Apple.1
11 7 Apple.2
12 8 Banana
13 9 Apple.1
14 9 Apple.1
15 10 Banana
CodePudding user response:
You could at first sample the 10 baskets, and then draw one apple in each pair of apples.
set.seed(1)
df %>%
filter(Basket %in% sample(unique(Basket), 5)) %>%
group_by(Basket) %>%
slice_sample(n = 1) %>%
ungroup()
# # A tibble: 5 × 2
# Basket Fruit
# <int> <chr>
# 1 1 Apple.1
# 2 2 Banana
# 3 4 Banana
# 4 7 Apple.2
# 5 9 Apple.1