I would like to remove NA of a variable within certain conditions of another variable, but can't figure out how the grouping / filtering within the expression needs to work.
I would like an entire trial
to be removed if there is >= 3 NA values in the nosetip
column for each phase_bins
. So if nosetip
for baseline
, stim_bin1
, stim_bin2
or recovery
has too much NA, I need the trial to be dropped.
I have tried
clean_df<- df %>%
group_by(ID, cond_f) %>%
filter(phase_bins== "stim_bin1") %>%
subset(!is.na(nosetip) >=3 )
but I would like to have all phase_bins
filtered and not do this 4 times over. How can I filter for several conditions and add them to one expression?
My data looks like this:
structure(list(ID = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L), .Label = c("UK103", "UK104", "UK105", "UK106", "UK107",
"UK108", "UK110", "UK111", "UK112", "UK113", "UK114", "UK115",
"UK116", "UK117", "UK119", "UK122", "UK123", "UK126", "UK130",
"UK132", "UK135", "UK136", "UK138", "UK139", "UK140", "UK147",
"UK148", "UK150", "UK153", "UK155", "UK159", "UK160", "UK162",
"UK163", "UK164", "UKA102", "UKA103", "UKA104", "UKA105", "UKA106",
"UKA107", "UKA108", "UKA109", "UKA110", "UKA111", "UKA112", "UKA113",
"UKA114", "UKA115", "UKA116", "UKA117", "UKA119", "UKA120", "UKA121",
"UKA122"), class = "factor"), cond_f = structure(c(4L, 4L, 4L,
4L, 4L, 4L, 4L, 4L, 4L, 4L), .Label = c("artificial", "babble",
"cry", "laugh"), class = "factor"), trial = structure(c(1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("1", "2", "3", "4"
), class = "factor"), nosetip = c(31.4, 31.6, 32, 31.6, 30.5,
31.5, 31.2, 31.3, 30.5, 31.1), phase_bins = structure(c(4L, 4L,
5L, 2L, 3L, 2L, 2L, 3L, 3L, 3L), .Label = c("pre", "baseline",
"stim_bin1", "stim_bin2", "recovery", "break"), class = "factor")), row.names = c(NA,
-10L), groups = structure(list(ID = structure(c(1L, 1L, 1L, 1L
), .Label = c("UK103", "UK104", "UK105", "UK106", "UK107", "UK108",
"UK110", "UK111", "UK112", "UK113", "UK114", "UK115", "UK116",
"UK117", "UK119", "UK122", "UK123", "UK126", "UK130", "UK132",
"UK135", "UK136", "UK138", "UK139", "UK140", "UK147", "UK148",
"UK150", "UK153", "UK155", "UK159", "UK160", "UK162", "UK163",
"UK164", "UKA102", "UKA103", "UKA104", "UKA105", "UKA106", "UKA107",
"UKA108", "UKA109", "UKA110", "UKA111", "UKA112", "UKA113", "UKA114",
"UKA115", "UKA116", "UKA117", "UKA119", "UKA120", "UKA121", "UKA122"
), class = "factor"), trial = structure(c(1L, 1L, 1L, 1L), .Label = c("1",
"2", "3", "4"), class = "factor"), cond_f = structure(c(4L, 4L,
4L, 4L), .Label = c("artificial", "babble", "cry", "laugh"), class = "factor"),
phase_bins = structure(2:5, .Label = c("pre", "baseline",
"stim_bin1", "stim_bin2", "recovery", "break"), class = "factor"),
.rows = structure(list(c(4L, 6L, 7L), c(5L, 8L, 9L, 10L),
1:2, 3L), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), row.names = c(NA, -4L), class = c("tbl_df",
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"))
Any ideas would be much appreciated! Thanks!
CodePudding user response:
You could probably reduce this code down further if you needed but I think this is more legible.
df %>%
group_by(ID, cond_f, phase_bins) %>%
mutate(n_na = sum(is.na(phase_bins))) %>%
group_by(ID, cond_f) %>%
filter(max(n_na) < 3) %>%
ungroup() %>%
select(-n_na)
count how many na's there are per phase, then find the max of all of these and filter out
CodePudding user response:
I am not sure if this is what you want, because your data has no NA
:
library(dplyr)
df %>%
group_by(trial) %>%
filter(sum(is.na(phase_bins)) >= 3) %>%
ungroup %>%
distinct(trial) %>%
anti_join(df, .)