I have a dataframe df
as below:
Input
id na_count task q1 q2 q3 q4 q5
7 3 a 1 NA NA 2 NA
7 1 b 1 0 0 NA 0
7 3 c NA NA 1 NA 1
9 0 a 1 1 0 2 1
9 1 b 1 0 0 1 NA
9 0 c 1 1 0 1 0
9 1 d 1 0 NA 1 1
3 3 a 1 NA NA 1 NA
3 1 b 1 1 NA 2 1
1 2 b 1 1 NA 1 NA
1 2 c 1 1 NA 1 NA
1 3 d NA NA 1 NA 1
2 4 a 1 NA NA NA NA
2 2 b 1 2 NA 1 NA
2 1 c 1 1 2 NA 2
2 1 d NA 1 3 3 3
2 0 e 2 2 3 3 4
I am interested in adding a binary column or flag
evidence
which is computed by looking at data perid
and then finding whether thatid
meets a minimum threshold of non-NA values.As an example, I have my minimum non-NA threshold set to 10. So if for any
id
there are at least 10 non-NA values (over multiple rows) then I want to set evidence toYes
, otherwise I want to set evidence toNo
(Prefered) If possible, I want to use the count of non-NA values from the column
na_count
rather than actually computing NAs over the columns q1:q5
Output
For the example with the threshold of 10 non-NA, my output would be as below:
id na_count task q1 q2 q3 q4 q5 evidence
7 3 a 1 NA NA 2 NA no
7 1 b 1 0 0 NA 0 no
7 3 c NA NA 1 NA 1 no
9 0 a 1 1 0 2 1 yes
9 1 b 1 0 0 1 NA yes
9 0 c 1 1 0 1 0 yes
9 1 d 1 0 NA 1 1 yes
3 3 a 1 NA NA 1 NA no
3 1 b 1 1 NA 2 1 no
1 2 b 1 1 NA 1 NA no
1 2 c 1 1 NA 1 NA no
1 3 d NA NA 1 NA 1 no
2 4 a 1 NA NA NA NA yes
2 2 b 1 2 NA 1 NA yes
2 1 c 1 1 2 NA 2 yes
2 1 d NA 1 3 3 3 yes
2 0 e 2 2 3 3 4 yes
Partial solution
I have tried the following, but it just counts the rows not the non-NA values over multiple rows for that id.
library(dplyr)
df = df %>%
group_by(id) %>%
mutate(rows = n())
Related Posts
The following posts are related but do not address my problem How to make n() do not count NAs too in tidyverse?, Taking a count() after group_by() for non-missing values and Count number of non-NA values by group
dput()
For coding, I am also copying the dput()
of the dataframe
# dput(df)
structure(list(
id = c(7L, 7L, 7L, 9L, 9L, 9L, 9L, 3L, 3L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L),
na_count = c(3L, 1L, 3L, 0L, 1L, 0L, 1L, 3L, 1L, 2L, 2L, 3L, 4L, 2L, 1L, 1L, 0L),
task = c("a", "b", "c", "a", "b", "c", "d", "a", "b", "b", "c", "d", "a", "b", "c", "d", "e"),
q1 = c(1L, 1L, NA, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, NA, 1L, 1L, 1L, NA, 2L),
q2 = c(NA, 0L, NA, 1L, 0L, 1L, 0L, NA, 1L, 1L, 1L, NA, NA, 2L, 1L, 1L, 2L),
q3 = c(NA, 0L, 1L, 0L, 0L, 0L, NA, NA, NA, NA, NA, 1L, NA, NA, 2L, 3L, 3L),
q4 = c(2L, NA, NA, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, NA, NA, 1L, NA, 3L, 3L),
q5 = c(NA, 0L, 1L, 1L, NA, 0L, 1L, NA, 1L, NA, NA, 1L, NA, NA, 2L, 3L, 4L)),
row.names = c(NA, -17L), class = "data.frame")
Any help on this would be greatly appreciated, thanks!
CodePudding user response:
library(tidyverse)
threshold = 10
df %>% group_by(id) %>%
mutate(evidence = ifelse(n()*5 - sum(na_count) >= threshold, "yes", "no"))
The 5 comes from the number of columns you have, q1:q5.
CodePudding user response:
A solution using the tidyverse
package. We can define a helper function to count the non-NA values, nest the data frame, apply the function to each id, and then unnest the data frame.
library(tidyverse)
count_non_na <- function(x, threshold = 10){
x2 <- x %>%
dplyr::select(starts_with("q")) %>%
unlist()
non_na <- sum(!is.na(x2)) >= threshold
if (non_na){
result <- "yes"
} else {
result <- "no"
}
return(result)
}
df2 <- df %>%
group_by(id) %>%
nest() %>%
mutate(evidence = map_chr(data, count_non_na)) %>%
unnest(cols = data) %>%
ungroup()
df2
# # A tibble: 17 x 9
# id na_count task q1 q2 q3 q4 q5 evidence
# <int> <int> <chr> <int> <int> <int> <int> <int> <chr>
# 1 7 3 a 1 NA NA 2 NA no
# 2 7 1 b 1 0 0 NA 0 no
# 3 7 3 c NA NA 1 NA 1 no
# 4 9 0 a 1 1 0 2 1 yes
# 5 9 1 b 1 0 0 1 NA yes
# 6 9 0 c 1 1 0 1 0 yes
# 7 9 1 d 1 0 NA 1 1 yes
# 8 3 3 a 1 NA NA 1 NA no
# 9 3 1 b 1 1 NA 2 1 no
# 10 1 2 b 1 1 NA 1 NA no
# 11 1 2 c 1 1 NA 1 NA no
# 12 1 3 d NA NA 1 NA 1 no
# 13 2 4 a 1 NA NA NA NA yes
# 14 2 2 b 1 2 NA 1 NA yes
# 15 2 1 c 1 1 2 NA 2 yes
# 16 2 1 d NA 1 3 3 3 yes
# 17 2 0 e 2 2 3 3 4 yes
Here is another idea. This solution only needs the dplyr
package, not the entire tidyverse
package.
df3 <- df %>%
group_by(id) %>%
summarize(across(starts_with("q"), .fns = ~sum(!is.na(.)))) %>%
mutate(Total = rowSums(select(., starts_with("q")))) %>%
mutate(evidence = ifelse(Total >= 10, "yes", "no")) %>%
select(id, evidence) %>%
right_join(df, by = "id") %>%
relocate(evidence, .after = q5)
df3
# # A tibble: 17 x 9
# id na_count task q1 q2 q3 q4 q5 evidence
# <int> <int> <chr> <int> <int> <int> <int> <int> <chr>
# 1 1 2 b 1 1 NA 1 NA no
# 2 1 2 c 1 1 NA 1 NA no
# 3 1 3 d NA NA 1 NA 1 no
# 4 2 4 a 1 NA NA NA NA yes
# 5 2 2 b 1 2 NA 1 NA yes
# 6 2 1 c 1 1 2 NA 2 yes
# 7 2 1 d NA 1 3 3 3 yes
# 8 2 0 e 2 2 3 3 4 yes
# 9 3 3 a 1 NA NA 1 NA no
# 10 3 1 b 1 1 NA 2 1 no
# 11 7 3 a 1 NA NA 2 NA no
# 12 7 1 b 1 0 0 NA 0 no
# 13 7 3 c NA NA 1 NA 1 no
# 14 9 0 a 1 1 0 2 1 yes
# 15 9 1 b 1 0 0 1 NA yes
# 16 9 0 c 1 1 0 1 0 yes
# 17 9 1 d 1 0 NA 1 1 yes