Filtering out groups that only have one type of value in R [duplicate]-CodePudding

I am trying to filter a out groups within a dataframe that only have one type of value associated with it. This is pretty simple I imagine. Here is my dataframe:-

example<-structure(list(UserID = c("AAA", "AAA", "AAA", "AAA", "AAA", 
                                   "AAA", "AAA", "AAA", "AAA", "BBB", "BBB", "BBB", "BBB", "BBB", 
                                   "BBB", "BBB", "BBB", "CCC", "CCC", "CCC", "CCC", "CCC", "CCC", 
                                   "CCC", "DDD", "DDD", "DDD", "DDD", "DDD", "DDD", "DDD", "DDD", 
                                   "DDD", "DDD", "DDD"), Status = c("Cluster 1", "Cluster 1", "Cluster 1", 
                                                                    "NotActive", "NotActive", "Cluster 1", "Cluster 2", "Cluster 2", 
                                                                    "Cluster 2", "NotActive", "NotActive", "NotActive", "NotActive", 
                                                                    "NotActive", "NotActive", "NotActive", "NotActive", "NotActive", 
                                                                    "NotActive", "NotActive", "NotActive", "Cluster 1", "Cluster 1", 
                                                                    "NotActive", "NotActive", "NotActive", "NotActive", "NotActive", 
                                                                    "NotActive", "NotActive", "NotActive", "NotActive", "NotActive", 
                                                                    "NotActive", "NotActive")), row.names = c(NA, -35L), class = c("tbl_df", 
                                                                                                                                   "tbl", "data.frame"))

Basically, I want to filter out groups that only have one status called "NotActive". Some users have a mixture of "Cluster _" and "NotActive", I would like to keep these. I have a large dataset with thousands of groups that I need to filter out, so it is not as simple as filtering out UserID BBB and DDD like in the example, so something that can be applied at a larger scale. UserID's like AAA and CCC, that have a mixture of values which include "NotActive", should be kept, only the users with exclusively have "NotActive" as a Status

Any pointers would be great :)

CodePudding user response：

You can use any to keep only the groups that have at least one value which is not 'NotActive'.

In dplyr, you can use -

library(dplyr)
example %>%  group_by(UserID) %>% filter(any(Status != 'NotActive'))

#   UserID Status   
#   <chr>  <chr>    
# 1 AAA    Cluster 1
# 2 AAA    Cluster 1
# 3 AAA    Cluster 1
# 4 AAA    NotActive
# 5 AAA    NotActive
# 6 AAA    Cluster 1
# 7 AAA    Cluster 2
# 8 AAA    Cluster 2
# 9 AAA    Cluster 2
#10 CCC    NotActive
#11 CCC    NotActive
#12 CCC    NotActive
#13 CCC    NotActive
#14 CCC    Cluster 1
#15 CCC    Cluster 1
#16 CCC    NotActive

The same in base R and data.table.

#Base R
subset(example, ave(Status != 'NotActive', UserID, FUN = any))


#data.table
library(data.table)
setDT(example)[, .SD[any(Status != 'NotActive')], UserID]

CodePudding user response：

Here is a way. Group by UserID and test the number of rows to see if it's greater than 1 or if any status is diferent from not active.

library(dplyr)

example %>% 
  group_by(UserID) %>%
  filter(n() > 1 | any(Status != "NotActive"))

CodePudding user response：

I don't know a pure dplyr solution, but if the tibble is not too large to apply table, this should do the trick. This solution does not keep UserIDs with more than a specific status (e.g. "NotActive"), but with more than 1 status, regardless which these are:

Get data

example<-structure(list(UserID = c("AAA", "AAA", "AAA", "AAA", "AAA", 
                                   "AAA", "AAA", "AAA", "AAA", "BBB", "BBB", "BBB", "BBB", "BBB", 
                                   "BBB", "BBB", "BBB", "CCC", "CCC", "CCC", "CCC", "CCC", "CCC", 
                                   "CCC", "DDD", "DDD", "DDD", "DDD", "DDD", "DDD", "DDD", "DDD", 
                                   "DDD", "DDD", "DDD"), Status = c("Cluster 1", "Cluster 1", "Cluster 1", 
                                                                    "NotActive", "NotActive", "Cluster 1", "Cluster 2", "Cluster 2", 
                                                                    "Cluster 2", "NotActive", "NotActive", "NotActive", "NotActive", 
                                                                    "NotActive", "NotActive", "NotActive", "NotActive", "NotActive", 
                                                                    "NotActive", "NotActive", "NotActive", "Cluster 1", "Cluster 1", 
                                                                    "NotActive", "NotActive", "NotActive", "NotActive", "NotActive", 
                                                                    "NotActive", "NotActive", "NotActive", "NotActive", "NotActive", 
                                                                    "NotActive", "NotActive")), row.names = c(NA, -35L), class = c("tbl_df", 
                                                                                                                                   "tbl", "data.frame"))


example
#> # A tibble: 35 × 2
#>    UserID Status   
#>    <chr>  <chr>    
#>  1 AAA    Cluster 1
#>  2 AAA    Cluster 1
#>  3 AAA    Cluster 1
#>  4 AAA    NotActive
#>  5 AAA    NotActive
#>  6 AAA    Cluster 1
#>  7 AAA    Cluster 2
#>  8 AAA    Cluster 2
#>  9 AAA    Cluster 2
#> 10 BBB    NotActive
#> # … with 25 more rows

filter rows

library(dplyr)

num_status <- apply(table(example$UserID, example$Status), 1, function(x) length(x[x>0]))

example %>% filter(UserID %in% names(which(num_status > 1)))

#> # A tibble: 16 × 2
#>    UserID Status   
#>    <chr>  <chr>    
#>  1 AAA    Cluster 1
#>  2 AAA    Cluster 1
#>  3 AAA    Cluster 1
#>  4 AAA    NotActive
#>  5 AAA    NotActive
#>  6 AAA    Cluster 1
#>  7 AAA    Cluster 2
#>  8 AAA    Cluster 2
#>  9 AAA    Cluster 2
#> 10 CCC    NotActive
#> 11 CCC    NotActive
#> 12 CCC    NotActive
#> 13 CCC    NotActive
#> 14 CCC    Cluster 1
#> 15 CCC    Cluster 1
#> 16 CCC    NotActive

^{Created on 2021-09-20 by the reprex package (v2.0.1)}