Home > Enterprise >  R function for randomly removing one data point at a time from each category?
R function for randomly removing one data point at a time from each category?

Time:10-15

I am trying to analyze how estimating percent cover of a reef changes as the number of points used to analyze the reef changes. My actual dataset consists of 150 photos each with 50 points. The idea is to have R estimate percent cover with all those points and then remove 1 point from each photo and reanalyze, then remove another point and reanalyze etc.

Any help in how I can write or find or where I can look for a function like this is welcome as I am very new to all this! Below is a sample dataset with just 3 plots and 5 points per plot. So as mentioned the idea is to analyze with all points, then randomly remove one point from each plot, reanalyze and repeat. Basically this sample the first analysis would be 15 points, the next analysis would be a total of 12 plots etc.

Sample dataset:

Plot ID
1    S
1    S
1    S
1    T
1    T
2    S
2    C
2    C
2    SP
2    S
3    S
3    T
3    T
3    C
3    T

Thank you!

CodePudding user response:

base R

set.seed(42)
dat[ave(rep(TRUE, nrow(dat)), dat$Plot, 
        FUN = function(z) length(z) > 1 & !seq_along(z) %in% sample(length(z), 1)),]
#    Plot ID
# 2     1  S
# 3     1  S
# 4     1  T
# 5     1  T
# 6     2  S
# 7     2  C
# 8     2  C
# 9     2 SP
# 12    3  T
# 13    3  T
# 14    3  C
# 15    3  T

I added the logic to ensure a minimum size of 1 (length(z) > 1), you might want to bump this up if you have different needs, or remove that condition if you don't care about removing a Plot when it has only one row.

dplyr

library(dplyr)
set.seed(42)
dat %>%
  group_by(Plot) %>%
  sample_n(n() - 1) %>%
  ungroup()
# # A tibble: 12 x 2
#     Plot ID   
#    <int> <chr>
#  1     1 S    
#  2     1 T    
#  3     1 T    
#  4     1 S    
#  5     2 C    
#  6     2 SP   
#  7     2 S    
#  8     2 C    
#  9     3 S    
# 10     3 C    
# 11     3 T    
# 12     3 T    

CodePudding user response:

Here is a base R function with tapply/sample.
Its arguments are the data.frame and the grouping column.

sample_rows <- function(data, group){
  group <- as.character(substitute(group))
  tapply(seq_len(nrow(data)), data[[group]], \(x) sample(x, 1))
}

set.seed(2021)

i <- sample_rows(df1, Plot)
df2 <- df1[-i, ]
nrow(df2)
#[1] 12

i <- sample_rows(df2, Plot)
df2 <- df2[-i, ]
nrow(df2)
#[1] 9

Data

df1 <-
structure(list(Plot = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 
3L, 3L, 3L, 3L, 3L), ID = c("S", "S", "S", "T", "T", "S", "C", 
"C", "SP", "S", "S", "T", "T", "C", "T")), class = "data.frame", 
row.names = c(NA, -15L))
  • Related