Randomly sample rows if more than 50 R-CodePudding

I have a list of 13 dataframes, the first few rows of the first dataframe look like this:

> Visit.data_allyears[[1]]
   SiteName year PAdata Longitude Latitude    totalspp  totalhours       lhours temperature    rainfall        NDVI
1    2229AB 2007      0    29.375  -22.125  0.27388999  0.04145321  0.359057436   0.7571729  0.34862768  0.25624133
2    2230CA 2007      0    30.125  -22.625 -0.46728113 -0.43741429 -0.460164072   0.8803066 -0.76683748 -0.15804871
3    2230DA 2007      0    30.625  -22.625 -0.79669052  0.28088696  0.670510998   1.0815264 -0.86448501 -0.68218838
4    2230DB 2007      0    30.875  -22.625 -1.99079956 -0.43741429 -0.460164072   1.3638363 -0.92470284 -0.86108636
5    2231AC 2007      0    31.125  -22.375  2.82681276 -0.43741429 -0.460164072   0.8652892  1.39814838  1.64976237  NA

The SiteNames entries may be repeated multiple times. For each dataframe, if SiteName is repeated more than 50 times - randomly sample 50 rows from all the repeated rows of that site name and remove the rest. Everything else should remain as is. So, if the site is not repeated more than 50 times, just leave as is.

How would one go about this?

CodePudding user response：

You may use slice_sample from dplyr.

library(dplyr)

lapply(Visit.data_allyears, function(x) {
  x %>% group_by(SiteName) %>% slice_sample(n = 50)
})