I have a dataset with over 400,000 cows. These cows are (unevenly) spreak over 2355 herds. Some herds are only present once in the data, while one herd is even present 2033 times in the data, meaning that 2033 cows belong to this herd. I want to delete herds from my data that occur less than 200 times.
With use of plyr
and subset
, I can obtain a list of which herds occur less than 200 times, I however can not find out how to apply this selection to the full dataset.
For example, my current data looks a little like:
cow herd
1 1
2 1
3 1
4 2
5 3
6 4
7 4
8 4
With function count()
I can obtain the following:
x freq
1 3
2 1
3 1
4 3
Say I want to delete the data belonging to herds that occur less than 3 times, I want my data to look like this eventually:
cow herd
1 1
2 1
3 1
6 4
7 4
8 4
I do know how to tell R to delete data herd by herd, however since, in my real datatset, over 1000 herds occur less then 200 times, it would mean that I would have to type every herd number in my script one by one. I am sure there is an easier and quicker way of asking R to delete data above or below a certain occurence.
I hope my explanation is clear and someone can help me, thanks in advance!
CodePudding user response:
Use n
group_by
:
library(dplyr)
your_data %>%
group_by(herd) %>%
filter(n() >= 3)