R: filter by multiple OR conditions-CodePudding

I need to filter a dataframe by multiple "OR" conditions. Let me explain.

I have a dataframe (total) with 1 million observations. One of the columns (id) contains id numbers ranging from 1 to 6000. This means that many of the rows have duplicate id numbers.

I previously drew a random sample of 500 unique id numbers.

random.id <- sample(abc, 500, replace=F)

I want to filter those rows in my original dataset where the id column matches any of values inrandom.id. In other words, I want to filter with many "OR" conditions. But since there are 500 conditions, I cant type them all out.

I've tried using the %in% operator.

filtered <- total %>%
  filter(id %in% random.id)

If the command worked as intended, then the new filtered dataframe should contain 500 unique id values.

length(unique(filtered$id))

Unfortunately, this number is way under 500. I re do the random sample for random.id but the the number of unique ids in the new dataframe is always under 500.

What should I do?

CodePudding user response：

Since you're using dplyr, here's a version of @Jon Spring's answer in dplyr syntax.
It does look like your issue is related to the contents of abc.

library(dplyr)

random_id <- sample(1:1000, 500, replace = F)
total <- tibble(id = sample(1:6000, 1e6, replace = T))

filtered <- total %>% filter(id %in% random_id)

n_distinct(filtered$id) # 500

Note: dplyr::n_distinct saves having to make two calls to length and unique.

CodePudding user response：

You didn't mention where abc came from, but if it has duplicates then you might not have actually drawn 500 unique id numbers.

When you take a sample from a vector with duplicates, some of the samples may be dupes themselves, even if you don't replace, since you may be sampling different instances of the same id.

We can get non-unique values from a sample without replacement if the source distribution itself has duplicate values:

set.seed(0)
sample(c(1,1,2), size = 3, replace = FALSE)
[1] 1 1 2

Or using something like your example:

set.seed(0)
abc = sample(1:6000, size = 1E6, replace = TRUE)

length(unique(sample(abc, 500, replace=F)))
[1] 477

length(unique(sample(unique(abc), 500, replace=F)))
[1] 500