I have a list of 13 dataframes, the first few rows of the first dataframe look like this:
> Visit.data_allyears[[1]]
SiteName year PAdata Longitude Latitude totalspp totalhours lhours temperature rainfall NDVI
1 2229AB 2007 0 29.375 -22.125 0.27388999 0.04145321 0.359057436 0.7571729 0.34862768 0.25624133
2 2230CA 2007 0 30.125 -22.625 -0.46728113 -0.43741429 -0.460164072 0.8803066 -0.76683748 -0.15804871
3 2230DA 2007 0 30.625 -22.625 -0.79669052 0.28088696 0.670510998 1.0815264 -0.86448501 -0.68218838
4 2230DB 2007 0 30.875 -22.625 -1.99079956 -0.43741429 -0.460164072 1.3638363 -0.92470284 -0.86108636
5 2231AC 2007 0 31.125 -22.375 2.82681276 -0.43741429 -0.460164072 0.8652892 1.39814838 1.64976237 NA
The SiteNames entries may be repeated multiple times. For each dataframe, if SiteName is repeated more than 50 times - randomly sample 50 rows from all the repeated rows of that site name and remove the rest. Everything else should remain as is. So, if the site is not repeated more than 50 times, just leave as is.
How would one go about this?
CodePudding user response:
You may use slice_sample
from dplyr
.
library(dplyr)
lapply(Visit.data_allyears, function(x) {
x %>% group_by(SiteName) %>% slice_sample(n = 50)
})