Home > front end >  Generate sample data according to a suposed proportion
Generate sample data according to a suposed proportion

Time:07-06

I am working on a project where products in production have a defect, but in very rare cases. For example 1/1,000,000 products have a defect.

How could I generate data, in R, Python, or Excel, that would represent samples from this distribution ?

CodePudding user response:

In R you could do: sample(c(1, rep(0, (1e6)-1)), size = 10)

You can adjust the sizing parameter accordingly. With size=10 you'll get 10 samples: [1] 0 0 0 0 0 0 0 0 0 0

It'll take a while before you see a 1 with this probability of 1/1e6.

CodePudding user response:

If you use the tidyverse in R you can do this:

library(tidyverse)
set.seed(1)
tibble(
  group = c(1, 1, 2, 2, 2),
  id = seq_len(5)
) %>% 
  group_by(group) %>% 
  add_count(name = "group_size") %>%
  slice_sample(n = min(.$group_size)) %>% 
  select(-group_size)
#> # A tibble: 4 × 2
#> # Groups:   group [2]
#>   group    id
#>   <dbl> <int>
#> 1     1     1
#> 2     1     2
#> 3     2     5
#> 4     2     3

Created on 2022-07-05 by the reprex package (v2.0.1)

You use the size of the smaller group as n and draw a sample from the other group with the same size. Instead of min(.$group_size) you can also set number, if you prefer.

  • Related