How to filter rows based on number of repeats of variable combinations-CodePudding

I have a dataset like this:

data <- tibble(year=c(2010,2010,2012,2010,2011,2011,2013,2013,2010,2011,2012,2013),
                  state=c("ca", "ca", "ca", "ny", "ny", "ny", "ny", "ny", "wa", "wa", "wa", "wa"), 
                  variable2=c("a", "b", "c", "b", "c", "a", "d", "a", "b", "b", "c", "b"),
                  value=c(6,5,2,6,3,1,7,8,3,2,5,7))

I would to select only the data for states with at least 3 unique years. In this data, that would be ny and wa. I would like to retain all the data for those respective states. Because of variable 2, some states have multiple data points for the same year, but I'm only interested in states with at least 3 unique years, regardless of the value for variable2. Thanks.

CodePudding user response：

You may try

library(dplyr)

data %>%
    group_by(state) %>% summarise(n = length(unique(year))) %>%
    filter(n>=3) %>% pull(state)

CodePudding user response：

You could define a function ulen for unique length, and use it in ave.

ulen <- \(x) length(unique(x))

data[with(data, ave(year, state, FUN=ulen)) > 2, ]
#    year state variable2 value
# 4  2010    ny         b     6
# 5  2011    ny         c     3
# 6  2011    ny         a     1
# 7  2013    ny         d     7
# 8  2013    ny         a     8
# 9  2010    wa         b     3
# 10 2011    wa         b     2
# 11 2012    wa         c     5
# 12 2013    wa         b     7

Data:

data <- structure(list(year = c(2010, 2010, 2012, 2010, 2011, 2011, 2013, 
2013, 2010, 2011, 2012, 2013), state = c("ca", "ca", "ca", "ny", 
"ny", "ny", "ny", "ny", "wa", "wa", "wa", "wa"), variable2 = c("a", 
"b", "c", "b", "c", "a", "d", "a", "b", "b", "c", "b"), value = c(6, 
5, 2, 6, 3, 1, 7, 8, 3, 2, 5, 7)), class = "data.frame", row.names = c(NA, 
-12L))

CodePudding user response：

Try this. The code removes rows where there are less then three unique years.

n<-levels(factor(data$state))

for(i in n){
 data_group<- data[data$state==i,]
 length_year<- length(unique(data_group$year))
 
 if(length_year<3){
 data<- data[!data$state==i, ]
 }
  
}