I am just starting to learn r studio. I have a data set that contains variables v1 to v6 that represent different groups, contains values 0 and 1 that represent the answer no and yes. So my question is How many respondents have at least 3 missing responses from questions v1 to v6?
CodePudding user response:
You can try to count the sum by row of your data.frame
Print and paste all the following code with seed included
#1- Simulation data
set.seed(1)
values=c(0,1,NA)
df=data.frame(
v1=sample(values,10,TRUE),
v2=sample(values,10,TRUE),
v3=sample(values,10,TRUE),
v4=sample(values,10,TRUE),
v5=sample(values,10,TRUE),
v6=sample(values,10,TRUE)
)
#2- Number of each value by row
#Number of NA values by row
df$nbNA=apply(df,1,function(x) sum(is.na(x)))
#Number of 0 values by row
df$nb0=apply(df,1,function(x) sum(x==0,na.rm=TRUE))
#Number of 1 values by row
df$nb1=apply(df,1,function(x) sum(x==1,na.rm=TRUE))
CodePudding user response:
Here is a solution in dplyr
(part of the tidyverse)
, where the final output will give you a tibble with number of missing responses for each individual.
library(tidyverse)
# Random number
set.seed(4)
# Make some example data, I assume it looks something like this
data = tibble(
v1 = sample(x = c("no","yes", NA), size = 100, replace = TRUE, prob = c(0.4, 0.4, 0.2)),
v2 = sample(x = c("no","yes", NA), size = 100, replace = TRUE, prob = c(0.4, 0.4, 0.2)),
v3 = sample(x = c("no","yes", NA), size = 100, replace = TRUE, prob = c(0.4, 0.4, 0.2)),
v4 = sample(x = c("no","yes", NA), size = 100, replace = TRUE, prob = c(0.4, 0.4, 0.2)),
v5 = sample(x = c("no","yes", NA), size = 100, replace = TRUE, prob = c(0.4, 0.4, 0.2)),
v6 = sample(x = c("no","yes", NA), size = 100, replace = TRUE, prob = c(0.4, 0.4, 0.2)),
id = 1:100
)
data
#> # A tibble: 100 x 7
#> v1 v2 v3 v4 v5 v6 id
#> <chr> <chr> <chr> <chr> <chr> <chr> <int>
#> 1 no yes no <NA> <NA> no 1
#> 2 yes no no no yes no 2
#> 3 yes yes no yes yes yes 3
#> 4 yes no yes yes no yes 4
#> 5 <NA> no yes <NA> yes yes 5
#> 6 yes no yes <NA> no <NA> 6
#> 7 no no no <NA> <NA> yes 7
#> 8 <NA> yes no <NA> <NA> yes 8
#> 9 <NA> <NA> no yes yes no 9
#> 10 yes <NA> yes <NA> yes yes 10
#> # ... with 90 more rows
# We then pivot the data into a long format
long_data = data %>%
pivot_longer(cols = starts_with("v"), names_to = "group", values_to = "response")
long_data
#> # A tibble: 600 x 3
#> id group response
#> <int> <chr> <chr>
#> 1 1 v1 no
#> 2 1 v2 yes
#> 3 1 v3 no
#> 4 1 v4 <NA>
#> 5 1 v5 <NA>
#> 6 1 v6 no
#> 7 2 v1 yes
#> 8 2 v2 no
#> 9 2 v3 no
#> 10 2 v4 no
#> # ... with 590 more rows
# We then summarise the number of missing values for each individual, and filter for those with > 3
long_data %>%
filter(is.na(response)) %>%
group_by(id) %>%
tally() %>%
filter(n > 2)
#> # A tibble: 9 x 2
#> id n
#> <int> <int>
#> 1 8 3
#> 2 14 3
#> 3 19 3
#> 4 26 3
#> 5 36 3
#> 6 41 3
#> 7 49 3
#> 8 84 4
#> 9 90 3
Created on 2021-10-07 by the reprex package (v0.3.0)