how to find out how many respondents have at least 3 missing responses?-CodePudding

I am just starting to learn r studio. I have a data set that contains variables v1 to v6 that represent different groups, contains values 0 and 1 that represent the answer no and yes. So my question is How many respondents have at least 3 missing responses from questions v1 to v6?

CodePudding user response：

You can try to count the sum by row of your data.frame

Print and paste all the following code with seed included

#1- Simulation data
set.seed(1)
values=c(0,1,NA)
    df=data.frame(
v1=sample(values,10,TRUE),
v2=sample(values,10,TRUE),
v3=sample(values,10,TRUE),
v4=sample(values,10,TRUE),
v5=sample(values,10,TRUE),
v6=sample(values,10,TRUE)
)

#2- Number of each value by row
#Number of NA values by row
df$nbNA=apply(df,1,function(x) sum(is.na(x)))

#Number of 0 values by row
df$nb0=apply(df,1,function(x) sum(x==0,na.rm=TRUE))

#Number of 1 values by row
df$nb1=apply(df,1,function(x) sum(x==1,na.rm=TRUE))

CodePudding user response：

Here is a solution in dplyr (part of the tidyverse), where the final output will give you a tibble with number of missing responses for each individual.

library(tidyverse)

# Random number
set.seed(4)

# Make some example data, I assume it looks something like this
data = tibble(
  v1 = sample(x = c("no","yes", NA), size = 100, replace = TRUE, prob = c(0.4, 0.4, 0.2)),
  v2 = sample(x = c("no","yes", NA), size = 100, replace = TRUE, prob = c(0.4, 0.4, 0.2)),
  v3 = sample(x = c("no","yes", NA), size = 100, replace = TRUE, prob = c(0.4, 0.4, 0.2)),
  v4 = sample(x = c("no","yes", NA), size = 100, replace = TRUE, prob = c(0.4, 0.4, 0.2)),
  v5 = sample(x = c("no","yes", NA), size = 100, replace = TRUE, prob = c(0.4, 0.4, 0.2)),
  v6 = sample(x = c("no","yes", NA), size = 100, replace = TRUE, prob = c(0.4, 0.4, 0.2)),
  id = 1:100
  )

data
#> # A tibble: 100 x 7
#>    v1    v2    v3    v4    v5    v6       id
#>    <chr> <chr> <chr> <chr> <chr> <chr> <int>
#>  1 no    yes   no    <NA>  <NA>  no        1
#>  2 yes   no    no    no    yes   no        2
#>  3 yes   yes   no    yes   yes   yes       3
#>  4 yes   no    yes   yes   no    yes       4
#>  5 <NA>  no    yes   <NA>  yes   yes       5
#>  6 yes   no    yes   <NA>  no    <NA>      6
#>  7 no    no    no    <NA>  <NA>  yes       7
#>  8 <NA>  yes   no    <NA>  <NA>  yes       8
#>  9 <NA>  <NA>  no    yes   yes   no        9
#> 10 yes   <NA>  yes   <NA>  yes   yes      10
#> # ... with 90 more rows

# We then pivot the data into a long format
long_data = data %>% 
  pivot_longer(cols = starts_with("v"), names_to = "group", values_to = "response")

long_data
#> # A tibble: 600 x 3
#>       id group response
#>    <int> <chr> <chr>   
#>  1     1 v1    no      
#>  2     1 v2    yes     
#>  3     1 v3    no      
#>  4     1 v4    <NA>    
#>  5     1 v5    <NA>    
#>  6     1 v6    no      
#>  7     2 v1    yes     
#>  8     2 v2    no      
#>  9     2 v3    no      
#> 10     2 v4    no      
#> # ... with 590 more rows


# We then summarise the number of missing values for each individual, and filter for those with > 3
long_data %>% 
  filter(is.na(response)) %>% 
  group_by(id) %>% 
  tally() %>% 
  filter(n > 2)
#> # A tibble: 9 x 2
#>      id     n
#>   <int> <int>
#> 1     8     3
#> 2    14     3
#> 3    19     3
#> 4    26     3
#> 5    36     3
#> 6    41     3
#> 7    49     3
#> 8    84     4
#> 9    90     3

^{Created on 2021-10-07 by the reprex package (v0.3.0)}