I have a dataset that looks like this:
Region | Name |
---|---|
Region 1 | Name 14 |
Region 2 | Name 18 |
Region 2 | Name 2 |
Region 2 | Name 21 |
Region 2 | Name 44 |
Region 3 | Name 64 |
Region 3 | Name 24 |
Region 4 | Name 1 |
Region 4 | Name 1 |
Region 4 | Name 98 |
Region 5 | Name 98 |
Region 5 | Name 8 |
Region 5 | Name 8 |
Region 5 | Name 8 |
Region 5 | Name 98 |
I need to breakup the data by Region, and then select a random sample of only 5% of the "Name" per Region, based on the number of rows in Region.
So lets say there are 30 Name in Region 2, then i need a random sample of 3*.05. If there are 50 Name in Region 6, then i need a random sample of 5*.05.
So far, ive been able to split() the data using
d = split(data, f = data$Region)
but when i try to run an lapply function i get an error that there are different number of rows in the list that split() provided
lapply(data, function(x) {
sample_n(data, nrow(d)*.05)
} )
Any thoughts?
Thank you
CodePudding user response:
Here's a base R solution.
lapply(split(data, data$Region),
\(x) x[sample(nrow(x), nrow(x) * 0.05),])
You can then convert it back into a data frame with rbind