So I have this data frame 'df', containing differente names of species, ID, class, and size.
I am trying to group the data frame by Species, followed by sampling three rows out of each grouped Species. I want to sample the three rows initially based on the class: so if a Species has 3 rows with "1_no" and 1 row with "2_yes" and another with "3_no", I want to keep the first 3, because the priority is given to the lowest number in class, and subsequently to the "yes" instead of the "no". So if one row has "3_yes" and "3_no", the "3_yes" should be kept in the data frame.
However, if a species, such as "Eutrigla gurnardus", has only "1_yes" in every row, I want to sample three rows of that grouped Species in a random way.
Species | ID| class| size
-----------------------------------------------------
Tilapia guineensis | 1| 1_yes| 400
Tilapia guineensis | 1| 1_no | 300
Tilapia guineensis | 1| 2_no| 700
Tilapia guineensis | 1| 3_yes | 900
Tilapia guineensis | 1| 3_yes | 900
Tilapia zillii | 2| 2_yes| 600
Tilapia zillii | 2| 2_no| 200
Tilapia zillii | 2| 1_yes| 500
Tilapia zillii | 2| 3_no| 200
Tilapia zillii | 2| 2_yes| 500
Eutrigla gurnardus | 5| 1_yes| 100
Eutrigla gurnardus | 5| 1_yes| 200
Eutrigla gurnardus | 5| 1_yes| 100
Eutrigla gurnardus | 5| 1_yes| 200
Sprattus sprattus | 6| 4_no| 300
Sprattus sprattus | 6| 3_yes | 400
Sprattus sprattus | 6| 4_yes | 300
Sprattus sprattus | 6| 5_yes| 400
My output would be something like this:
Species | ID| class| size
-----------------------------------------------------
Tilapia guineensis | 1| 1_yes| 400
Tilapia guineensis | 1| 1_no | 300
Tilapia guineensis | 1| 2_no| 700
Tilapia zillii | 2| 2_yes| 600
Tilapia zillii | 2| 1_yes| 500
Tilapia zillii | 2| 2_yes| 500
Eutrigla gurnardus | 5| 1_yes| 100
Eutrigla gurnardus | 5| 1_yes| 100
Eutrigla gurnardus | 5| 1_yes| 200
Sprattus sprattus | 6| 4_no| 300
Sprattus sprattus | 6| 3_yes | 400
Sprattus sprattus | 6| 4_yes | 300
CodePudding user response:
You can randomly sort the data, then arrange again by the two components of class
to preferentially but randomly choose the top 3 rows within each Species
.
df <- structure(list(Species = c("Tilapia guineensis", "Tilapia guineensis",
"Tilapia guineensis", "Tilapia guineensis", "Tilapia guineensis",
"Tilapia zillii", "Tilapia zillii", "Tilapia zillii", "Tilapia zillii",
"Tilapia zillii", "Eutrigla gurnardus", "Eutrigla gurnardus",
"Eutrigla gurnardus", "Eutrigla gurnardus", "Sprattus sprattus",
"Sprattus sprattus", "Sprattus sprattus", "Sprattus sprattus"
), ID = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 5, 5, 5, 5, 6, 6, 6,
6), class = c("1_yes", "1_no", "2_no", "3_yes", "3_yes", "2_yes",
"2_no", "1_yes", "3_no", "2_yes", "1_yes", "1_yes", "1_yes",
"1_yes", "4_no", "3_yes", "4_yes", "5_yes"), size = c(400, 300,
700, 900, 900, 600, 200, 500, 200, 500, 100, 200, 100, 200, 300,
400, 300, 400)), class = c("data.frame"), row.names = c(NA,
-18L))
library(dplyr)
library(tidyr)
df %>%
# split class into its two components
separate(class, into = c("number", "yesno"),
remove = FALSE, convert = TRUE) %>%
group_by(Species) %>%
# random order
slice_sample(prop = 1) %>%
# arrange by 1, 2, 3... yes, no on top of random order
arrange(number, desc(yesno)) %>%
# take the first 3
slice_head(n = 3) %>%
select(-number, -yesno)
#> # A tibble: 12 × 4
#> # Groups: Species [4]
#> Species ID class size
#> <chr> <dbl> <chr> <dbl>
#> 1 Eutrigla gurnardus 5 1_yes 200
#> 2 Eutrigla gurnardus 5 1_yes 100
#> 3 Eutrigla gurnardus 5 1_yes 200
#> 4 Sprattus sprattus 6 3_yes 400
#> 5 Sprattus sprattus 6 4_yes 300
#> 6 Sprattus sprattus 6 4_no 300
#> 7 Tilapia guineensis 1 1_yes 400
#> 8 Tilapia guineensis 1 1_no 300
#> 9 Tilapia guineensis 1 2_no 700
#> 10 Tilapia zillii 2 1_yes 500
#> 11 Tilapia zillii 2 2_yes 500
#> 12 Tilapia zillii 2 2_yes 600
Created on 2022-05-26 by the reprex package (v2.0.1)