Removing NA > x based on grouping condition of other variable-CodePudding

I would like to remove NA of a variable within certain conditions of another variable, but can't figure out how the grouping / filtering within the expression needs to work.

I would like an entire trial to be removed if there is >= 3 NA values in the nosetip column for each phase_bins. So if nosetip for baseline, stim_bin1, stim_bin2 or recovery has too much NA, I need the trial to be dropped.

I have tried

clean_df<- df %>% 
  group_by(ID, cond_f) %>% 
  filter(phase_bins== "stim_bin1") %>% 
  subset(!is.na(nosetip) >=3 )

but I would like to have all phase_bins filtered and not do this 4 times over. How can I filter for several conditions and add them to one expression?

My data looks like this:

structure(list(ID = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L), .Label = c("UK103", "UK104", "UK105", "UK106", "UK107", 
"UK108", "UK110", "UK111", "UK112", "UK113", "UK114", "UK115", 
"UK116", "UK117", "UK119", "UK122", "UK123", "UK126", "UK130", 
"UK132", "UK135", "UK136", "UK138", "UK139", "UK140", "UK147", 
"UK148", "UK150", "UK153", "UK155", "UK159", "UK160", "UK162", 
"UK163", "UK164", "UKA102", "UKA103", "UKA104", "UKA105", "UKA106", 
"UKA107", "UKA108", "UKA109", "UKA110", "UKA111", "UKA112", "UKA113", 
"UKA114", "UKA115", "UKA116", "UKA117", "UKA119", "UKA120", "UKA121", 
"UKA122"), class = "factor"), cond_f = structure(c(4L, 4L, 4L, 
4L, 4L, 4L, 4L, 4L, 4L, 4L), .Label = c("artificial", "babble", 
"cry", "laugh"), class = "factor"), trial = structure(c(1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("1", "2", "3", "4"
), class = "factor"), nosetip = c(31.4, 31.6, 32, 31.6, 30.5, 
31.5, 31.2, 31.3, 30.5, 31.1), phase_bins = structure(c(4L, 4L, 
5L, 2L, 3L, 2L, 2L, 3L, 3L, 3L), .Label = c("pre", "baseline", 
"stim_bin1", "stim_bin2", "recovery", "break"), class = "factor")), row.names = c(NA, 
-10L), groups = structure(list(ID = structure(c(1L, 1L, 1L, 1L
), .Label = c("UK103", "UK104", "UK105", "UK106", "UK107", "UK108", 
"UK110", "UK111", "UK112", "UK113", "UK114", "UK115", "UK116", 
"UK117", "UK119", "UK122", "UK123", "UK126", "UK130", "UK132", 
"UK135", "UK136", "UK138", "UK139", "UK140", "UK147", "UK148", 
"UK150", "UK153", "UK155", "UK159", "UK160", "UK162", "UK163", 
"UK164", "UKA102", "UKA103", "UKA104", "UKA105", "UKA106", "UKA107", 
"UKA108", "UKA109", "UKA110", "UKA111", "UKA112", "UKA113", "UKA114", 
"UKA115", "UKA116", "UKA117", "UKA119", "UKA120", "UKA121", "UKA122"
), class = "factor"), trial = structure(c(1L, 1L, 1L, 1L), .Label = c("1", 
"2", "3", "4"), class = "factor"), cond_f = structure(c(4L, 4L, 
4L, 4L), .Label = c("artificial", "babble", "cry", "laugh"), class = "factor"), 
    phase_bins = structure(2:5, .Label = c("pre", "baseline", 
    "stim_bin1", "stim_bin2", "recovery", "break"), class = "factor"), 
    .rows = structure(list(c(4L, 6L, 7L), c(5L, 8L, 9L, 10L), 
        1:2, 3L), ptype = integer(0), class = c("vctrs_list_of", 
    "vctrs_vctr", "list"))), row.names = c(NA, -4L), class = c("tbl_df", 
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df", 
"tbl_df", "tbl", "data.frame"))

Any ideas would be much appreciated! Thanks!

CodePudding user response：

You could probably reduce this code down further if you needed but I think this is more legible.

df %>% 
  group_by(ID, cond_f, phase_bins) %>% 
  mutate(n_na = sum(is.na(phase_bins))) %>%
  group_by(ID, cond_f) %>%
  filter(max(n_na) < 3) %>%
  ungroup() %>%
  select(-n_na)

count how many na's there are per phase, then find the max of all of these and filter out

CodePudding user response：

I am not sure if this is what you want, because your data has no NA:

library(dplyr)
df %>% 
  group_by(trial) %>% 
  filter(sum(is.na(phase_bins)) >= 3) %>%
  ungroup %>% 
  distinct(trial)  %>%
  anti_join(df, .)