Home > Back-end >  How to sample rows from a data frame that was previously grouped by a specific column, according to
How to sample rows from a data frame that was previously grouped by a specific column, according to

Time:05-27

So I have this data frame 'df', containing differente names of species, ID, class, and size.

I am trying to group the data frame by Species, followed by sampling three rows out of each grouped Species. I want to sample the three rows initially based on the class: so if a Species has 3 rows with "1_no" and 1 row with "2_yes" and another with "3_no", I want to keep the first 3, because the priority is given to the lowest number in class, and subsequently to the "yes" instead of the "no". So if one row has "3_yes" and "3_no", the "3_yes" should be kept in the data frame.

However, if a species, such as "Eutrigla gurnardus", has only "1_yes" in every row, I want to sample three rows of that grouped Species in a random way.

 Species           |  ID| class| size   
-----------------------------------------------------
Tilapia guineensis |   1|   1_yes|  400
Tilapia guineensis |   1|   1_no |  300
Tilapia guineensis |   1|   2_no|  700
Tilapia guineensis |   1|   3_yes |  900
Tilapia guineensis |   1|   3_yes |  900
Tilapia zillii     |   2|   2_yes|  600
Tilapia zillii     |   2|   2_no|  200
Tilapia zillii     |   2|   1_yes|  500
Tilapia zillii     |   2|   3_no|  200
Tilapia zillii     |   2|   2_yes|  500
Eutrigla gurnardus |   5|   1_yes|  100
Eutrigla gurnardus |   5|   1_yes|  200
Eutrigla gurnardus |   5|   1_yes|  100
Eutrigla gurnardus |   5|   1_yes|  200  
Sprattus sprattus  |   6|   4_no|  300 
Sprattus sprattus  |   6|   3_yes |  400
Sprattus sprattus  |   6|   4_yes |  300 
Sprattus sprattus  |   6|   5_yes|  400

My output would be something like this:

 Species           |  ID| class| size   
-----------------------------------------------------
Tilapia guineensis |   1|   1_yes|  400
Tilapia guineensis |   1|   1_no |  300
Tilapia guineensis |   1|   2_no|  700
Tilapia zillii     |   2|   2_yes|  600
Tilapia zillii     |   2|   1_yes|  500
Tilapia zillii     |   2|   2_yes|  500
Eutrigla gurnardus |   5|   1_yes|  100
Eutrigla gurnardus |   5|   1_yes|  100
Eutrigla gurnardus |   5|   1_yes|  200  
Sprattus sprattus  |   6|   4_no|  300 
Sprattus sprattus  |   6|   3_yes |  400
Sprattus sprattus  |   6|   4_yes |  300 

CodePudding user response:

You can randomly sort the data, then arrange again by the two components of class to preferentially but randomly choose the top 3 rows within each Species.

df <- structure(list(Species = c("Tilapia guineensis", "Tilapia guineensis", 
                                 "Tilapia guineensis", "Tilapia guineensis", "Tilapia guineensis", 
                                 "Tilapia zillii", "Tilapia zillii", "Tilapia zillii", "Tilapia zillii", 
                                 "Tilapia zillii", "Eutrigla gurnardus", "Eutrigla gurnardus", 
                                 "Eutrigla gurnardus", "Eutrigla gurnardus", "Sprattus sprattus", 
                                 "Sprattus sprattus", "Sprattus sprattus", "Sprattus sprattus"
), ID = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 5, 5, 5, 5, 6, 6, 6, 
          6), class = c("1_yes", "1_no", "2_no", "3_yes", "3_yes", "2_yes", 
                        "2_no", "1_yes", "3_no", "2_yes", "1_yes", "1_yes", "1_yes", 
                        "1_yes", "4_no", "3_yes", "4_yes", "5_yes"), size = c(400, 300, 
                                                                              700, 900, 900, 600, 200, 500, 200, 500, 100, 200, 100, 200, 300, 
                                                                              400, 300, 400)), class = c("data.frame"), row.names = c(NA, 
                                                                                                                                                       -18L))
library(dplyr)
library(tidyr)

df %>% 
  # split class into its two components
  separate(class, into = c("number", "yesno"), 
           remove = FALSE, convert = TRUE) %>% 
  group_by(Species) %>% 
  # random order
  slice_sample(prop = 1) %>% 
  # arrange by 1, 2, 3... yes, no on top of random order
  arrange(number, desc(yesno)) %>% 
  # take the first 3
  slice_head(n = 3) %>% 
  select(-number, -yesno)

#> # A tibble: 12 × 4
#> # Groups:   Species [4]
#>    Species               ID class  size
#>    <chr>              <dbl> <chr> <dbl>
#>  1 Eutrigla gurnardus     5 1_yes   200
#>  2 Eutrigla gurnardus     5 1_yes   100
#>  3 Eutrigla gurnardus     5 1_yes   200
#>  4 Sprattus sprattus      6 3_yes   400
#>  5 Sprattus sprattus      6 4_yes   300
#>  6 Sprattus sprattus      6 4_no    300
#>  7 Tilapia guineensis     1 1_yes   400
#>  8 Tilapia guineensis     1 1_no    300
#>  9 Tilapia guineensis     1 2_no    700
#> 10 Tilapia zillii         2 1_yes   500
#> 11 Tilapia zillii         2 2_yes   500
#> 12 Tilapia zillii         2 2_yes   600

Created on 2022-05-26 by the reprex package (v2.0.1)

  • Related