I am trying to filter a out groups within a dataframe that only have one type of value associated with it. This is pretty simple I imagine. Here is my dataframe:-
example<-structure(list(UserID = c("AAA", "AAA", "AAA", "AAA", "AAA",
"AAA", "AAA", "AAA", "AAA", "BBB", "BBB", "BBB", "BBB", "BBB",
"BBB", "BBB", "BBB", "CCC", "CCC", "CCC", "CCC", "CCC", "CCC",
"CCC", "DDD", "DDD", "DDD", "DDD", "DDD", "DDD", "DDD", "DDD",
"DDD", "DDD", "DDD"), Status = c("Cluster 1", "Cluster 1", "Cluster 1",
"NotActive", "NotActive", "Cluster 1", "Cluster 2", "Cluster 2",
"Cluster 2", "NotActive", "NotActive", "NotActive", "NotActive",
"NotActive", "NotActive", "NotActive", "NotActive", "NotActive",
"NotActive", "NotActive", "NotActive", "Cluster 1", "Cluster 1",
"NotActive", "NotActive", "NotActive", "NotActive", "NotActive",
"NotActive", "NotActive", "NotActive", "NotActive", "NotActive",
"NotActive", "NotActive")), row.names = c(NA, -35L), class = c("tbl_df",
"tbl", "data.frame"))
Basically, I want to filter out groups that only have one status called "NotActive". Some users have a mixture of "Cluster _" and "NotActive", I would like to keep these.
I have a large dataset with thousands of groups that I need to filter out, so it is not as simple as filtering out UserID
BBB and DDD like in the example, so something that can be applied at a larger scale. UserID's like AAA and CCC, that have a mixture of values which include "NotActive", should be kept, only the users with exclusively have "NotActive" as a Status
Any pointers would be great :)
CodePudding user response:
You can use any
to keep only the groups that have at least one value which is not 'NotActive'
.
In dplyr
, you can use -
library(dplyr)
example %>% group_by(UserID) %>% filter(any(Status != 'NotActive'))
# UserID Status
# <chr> <chr>
# 1 AAA Cluster 1
# 2 AAA Cluster 1
# 3 AAA Cluster 1
# 4 AAA NotActive
# 5 AAA NotActive
# 6 AAA Cluster 1
# 7 AAA Cluster 2
# 8 AAA Cluster 2
# 9 AAA Cluster 2
#10 CCC NotActive
#11 CCC NotActive
#12 CCC NotActive
#13 CCC NotActive
#14 CCC Cluster 1
#15 CCC Cluster 1
#16 CCC NotActive
The same in base R and data.table
.
#Base R
subset(example, ave(Status != 'NotActive', UserID, FUN = any))
#data.table
library(data.table)
setDT(example)[, .SD[any(Status != 'NotActive')], UserID]
CodePudding user response:
Here is a way. Group by UserID
and test the number of rows to see if it's greater than 1 or if any status is diferent from not active.
library(dplyr)
example %>%
group_by(UserID) %>%
filter(n() > 1 | any(Status != "NotActive"))
CodePudding user response:
I don't know a pure dplyr
solution, but if the tibble
is not too large to apply table
, this should do the trick. This solution does not keep UserIDs with more than a specific status (e.g. "NotActive"), but with more than 1 status, regardless which these are:
Get data
example<-structure(list(UserID = c("AAA", "AAA", "AAA", "AAA", "AAA",
"AAA", "AAA", "AAA", "AAA", "BBB", "BBB", "BBB", "BBB", "BBB",
"BBB", "BBB", "BBB", "CCC", "CCC", "CCC", "CCC", "CCC", "CCC",
"CCC", "DDD", "DDD", "DDD", "DDD", "DDD", "DDD", "DDD", "DDD",
"DDD", "DDD", "DDD"), Status = c("Cluster 1", "Cluster 1", "Cluster 1",
"NotActive", "NotActive", "Cluster 1", "Cluster 2", "Cluster 2",
"Cluster 2", "NotActive", "NotActive", "NotActive", "NotActive",
"NotActive", "NotActive", "NotActive", "NotActive", "NotActive",
"NotActive", "NotActive", "NotActive", "Cluster 1", "Cluster 1",
"NotActive", "NotActive", "NotActive", "NotActive", "NotActive",
"NotActive", "NotActive", "NotActive", "NotActive", "NotActive",
"NotActive", "NotActive")), row.names = c(NA, -35L), class = c("tbl_df",
"tbl", "data.frame"))
example
#> # A tibble: 35 × 2
#> UserID Status
#> <chr> <chr>
#> 1 AAA Cluster 1
#> 2 AAA Cluster 1
#> 3 AAA Cluster 1
#> 4 AAA NotActive
#> 5 AAA NotActive
#> 6 AAA Cluster 1
#> 7 AAA Cluster 2
#> 8 AAA Cluster 2
#> 9 AAA Cluster 2
#> 10 BBB NotActive
#> # … with 25 more rows
filter rows
library(dplyr)
num_status <- apply(table(example$UserID, example$Status), 1, function(x) length(x[x>0]))
example %>% filter(UserID %in% names(which(num_status > 1)))
#> # A tibble: 16 × 2
#> UserID Status
#> <chr> <chr>
#> 1 AAA Cluster 1
#> 2 AAA Cluster 1
#> 3 AAA Cluster 1
#> 4 AAA NotActive
#> 5 AAA NotActive
#> 6 AAA Cluster 1
#> 7 AAA Cluster 2
#> 8 AAA Cluster 2
#> 9 AAA Cluster 2
#> 10 CCC NotActive
#> 11 CCC NotActive
#> 12 CCC NotActive
#> 13 CCC NotActive
#> 14 CCC Cluster 1
#> 15 CCC Cluster 1
#> 16 CCC NotActive
Created on 2021-09-20 by the reprex package (v2.0.1)