Home > Software engineering >  Filter group to include column-wise condition with dplyr
Filter group to include column-wise condition with dplyr

Time:05-18

I would like to subset or filter some grouped data in dplyr to include only groups which have 2 different levels of categorical data. My data look like this:

enter image description here

And I would like my output to only include the health_facility which have both "malaria" and "non-malaria" present in their season column.

I have tried

multi_hf %>%
group_by(health_facility) %>%
filter(season == "malaria" & season == "non-malaria") 

However I get a tibble of only NA values.

Any help much appreciated! Data:

structure(list(season = c("malaria", "malaria", "malaria", "malaria", 
"malaria", "malaria", "malaria", "malaria", "malaria", "malaria", 
"malaria", "malaria", "malaria", "malaria", "malaria", "malaria", 
"malaria", "malaria", "malaria", "malaria", "malaria", "malaria", 
"malaria", "malaria", "malaria", "malaria", "non-malaria", "non-malaria", 
"non-malaria", "non-malaria", "non-malaria", "non-malaria", "non-malaria", 
"non-malaria", "non-malaria", "non-malaria", "non-malaria", "non-malaria", 
"non-malaria", "non-malaria", "non-malaria", "non-malaria", "non-malaria", 
"non-malaria", "non-malaria", "non-malaria", "non-malaria", "non-malaria", 
"non-malaria", "non-malaria", "non-malaria", "non-malaria", "non-malaria", 
"non-malaria", "non-malaria", "non-malaria", "non-malaria"), 
    health_facility = c("Hospital Agostinho Neto", "Hospital Baptista de Sousa", 
    "Health Delegation São Miguel", "Health Center Chã de Alecrim", 
    "Health Center Fonte Inês", "Health Delegation Maio", "Health Delegation Sao Vincente", 
    "Health Delegation Sao Vincente", "Hospital Ribeira Grande", 
    "Health Delegation Ribeira Brava", "Health Delegation Santa Cruz", 
    "Health Delegation Paul", "Center Delegation Santa Catarina", 
    "Regional Hospital Fogo e Brava", "Health Delegation São Filipe", 
    "Health Center Cidade Velha", "Health Delegation Tarrafal Santiago", 
    "Health Delegation Tarrafal Santiago", "Health Delegation Tarrafal Santiago", 
    "Health Center Sao Salvador do Mundo – Picos", "Health Delegation Tarrafal Santiago", 
    "Health Delegation São Lourenço dos Orgaos", "Health Delegation Ribeira Grande", 
    "Health Delegation of Praia", "Center Delegation Santa Catarina", 
    "Regional Hospital Santiago Norte", "Health Delegation Ribeira Brava", 
    "Health Delegation Ribeira Brava", "Hospital Baptista de Sousa", 
    "Health Delegation Paul", "Health Delegation Ribeira Brava", 
    "Health Center Sao Salvador do Mundo – Picos", "Health Delegation Sao Vincente", 
    "Health Delegation São Miguel", "Health Delegation Tarrafal Santiago", 
    "Regional Hospital Santiago Norte", "Regional Hospital Santiago Norte", 
    "Regional Hospital Santiago Norte", "Regional Hospital Santiago Norte", 
    "Health Delegation Sao Vincente", "Regional Hospital Fogo e Brava", 
    "Center Delegation Santa Catarina", "Health Center Chã de Alecrim", 
    "Hospital Agostinho Neto", "Hospital Ribeira Grande", "Health Delegation São Lourenço dos Orgaos", 
    "Health Delegation São Lourenço dos Orgaos", "Health Delegation São Filipe", 
    "Health Center Fonte Inês", "Hospital Agostinho Neto", "Regional Hospital Fogo e Brava", 
    "Health Delegation of Praia", "Health Delegation Maio", "Health Delegation Ribeira Grande", 
    "Health Delegation São Lourenço dos Orgaos", "Health Delegation Santa Cruz", 
    "Health Center Cidade Velha")), class = c("data.table", "data.frame"
), row.names = c(NA, -57L), .internal.selfref = <pointer: 0x0000017c5a4b1ef0>)

CodePudding user response:

Personally I prefer a cleaner solution. Using n_distinct fits really well here:

df %>%
  group_by(health_facility) %>%
  filter(n_distinct(season) == 2) %>%
  ungroup()

CodePudding user response:

filter(season == "malaria" & season == "non-malaria") means that select row that has both "malaria" and "non-malaria" which is not possible since one row can have only one value. That is why you get 0 rows in the sample data shared. There are no NA rows in the output of sample data but that is because it does not contain any NA values in the sample data. An NA value is returned when you compare with ==, if you use %in% that should help.

So probably you want to select a health_facility which has both the values which can be done as -

library(dplyr)

multi_hf %>%
  arrange(health_facility) %>%
  group_by(health_facility) %>%
  filter(all(c("malaria", "non-malaria") %in% season)) %>%
  ungroup()
  • Related