Home > Back-end >  filter by specific category in r
filter by specific category in r

Time:10-29

I have a specific filtering question. Here is how my sample dataset looks like:

df <- data.frame(id = c(1,2,3,3,4,5),
                 cat= c("A","A","A","B","B","B"))
> df
  id cat
1  1     A
2  2     A
3  3     A
4  3     B
5  4     B
6  5     B

Grouping by id, when the cat has multiple categories, I would only filter cat A. So the desired output would be:

> df.1
  id cat
1  1   A
2  2   A
3  3   A
4  4   B
5  5   B

Any ideas?

Thanks!

CodePudding user response:

In this example you can take the first item from the group. In other situations you may need to reorder arrange before.

(using dplyr)

df %>% group_by(id) %>% summarise(cat = first(cat))

CodePudding user response:

Base R:

aggregate(
  df$cat, 
  by = list(id = df$id), 
  FUN = \(x) {
    unx <- unique(x)
    if (length(unx) > 1) 'A' else unx
  }
)
#   id x
# 1  1 A
# 2  2 A
# 3  3 A
# 4  4 B
# 5  5 B

CodePudding user response:

If there are only two groups in cat, we can use the following logic:

df %>%
  group_by(id) %>%
  filter(! (n() == 2 & cat == "B"))

# A tibble: 5 x 2
# Groups:   id [5]
     id cat  
  <dbl> <chr>
1     1 A    
2     2 A    
3     3 A    
4     4 B    
5     5 B 

When there are multiple other letters possible

df <- data.frame(id = c(1,2,3,3,4,5,6,6,6,7),
                 cat= c("A","A","A","B","B","B", "A", "B", "C","D"))
df %>%
  group_by(id) %>%
  filter(! (n() >= 2 & cat %in% LETTERS[2:26]))
# A tibble: 7 x 2
# Groups:   id [7]
     id cat  
  <dbl> <chr>
1     1 A    
2     2 A    
3     3 A    
4     4 B    
5     5 B    
6     6 A    
7     7 D 

Explanation: n() gives the current group size. When that condition is met, we filter for everything that is not "B".

CodePudding user response:

One approach with dplyr. After grouping by id, filter where there is only one row per id or cat is "A".

library(dplyr)

df %>%
  group_by(id) %>%
  filter(n() == 1 | cat == "A")

Output

     id cat  
  <dbl> <chr>
1     1 A    
2     2 A    
3     3 A    
4     4 B    
5     5 B 

Also, if it is possible to have the same cat repeated within a single id, you can filter where the number of distinct cat is 1 (or keep if cat is "A"):

df %>%
  group_by(id) %>%
  filter(n_distinct(cat) == 1 | cat == "A")

CodePudding user response:

Using base R

 subset(df, cat == 'A'|id %in% names(which(table(id) == 1)))
  id cat
1  1   A
2  2   A
3  3   A
5  4   B
6  5   B
  • Related