remove rows with >50% missing across certain columns in R-CodePudding

here is my data:

data <- data.frame(id=c(1,2,3,4,5),
                   ethnicity=c("asian",NA,NA,NA,"asian"),
                   age=c(34,NA,NA,NA,65),
                   a1=c(3,4,5,2,7),
                   a2=c("y","y","y",NA,NA),
                   a3=c("low", NA, "high", "med", NA),
                   a4=c("green", NA, "blue", "orange", NA))


  id ethnicity age a1   a2   a3     a4
   1     asian  34  3    y  low  green
   2      <NA>  NA  4    y <NA>   <NA>
   3      <NA>  NA  5    y high   blue
   4      <NA>  NA  2 <NA>  med orange
   5     asian  65  7 <NA> <NA>   <NA>

I would like to remove rows that have >50% missing in columns a1 to a4.

I have tried the below code; but am having trouble specifying the columns that I want this to take effect for:

data[which(rowMeans(!is.na(data)) > 0.5), ] #This doesn't specify the column

miss2 <- c()
for(i in 1:nrow(data)) {
  if(length(which(is.na(data[4:7,]))) >= 0.5*ncol(data)) miss2 <- append(miss2,4:7) 
}
data1 <- data[-miss2,]

#I thought I specified the column here but im not getting the output I was hoping for (i.e id 4 doesn't show up)

The code above looks at missing in all columns. I would like to specify to just look for % of missing in columns a1,a2,a3,a4. What im hoping to get is below:

  id ethnicity age a1   a2   a3     a4
   1     asian  34  3    y  low  green
   2      <NA>  NA  4    y <NA>   <NA>
   3      <NA>  NA  5    y high   blue
   4      <NA>  NA  2 <NA>  med orange

Any help is appreciated, thank you!

CodePudding user response：

You're really close, the main issue being using which (an array of indices) instead of simply an array of booleans

keep <- rowMeans(is.na(data[,4:7])) <= 0.5

keep
[1]  TRUE  TRUE  TRUE  TRUE FALSE

data[keep,]
  id ethnicity age a1   a2   a3     a4
1  1     asian  34  3    y  low  green
2  2      <NA>  NA  4    y <NA>   <NA>
3  3      <NA>  NA  5    y high   blue
4  4      <NA>  NA  2 <NA>  med orange

CodePudding user response：

Just for fun a dplyr approach: Here we combine rowwise with a comparing statement directly in filter. First we check the sum of NA over a1:a4, divide by the amount of columns and ask if condition <= 0.5 is true:

To do this we have to transform all (a1:a4) to the same class:

data %>% 
  rowwise() %>% 
  mutate(a1 = as.character(a1)) %>% 
  filter(sum(is.na(c_across(a1:a4))) / length(c_across(a1:a4)) <= 0.5)

     id ethnicity   age a1    a2    a3    a4    
  <dbl> <chr>     <dbl> <chr> <chr> <chr> <chr> 
1     1 asian        34 3     y     low   green 
2     2 NA           NA 4     y     NA    NA    
3     3 NA           NA 5     y     high  blue  
4     4 NA           NA 2     NA    med   orange

CodePudding user response：

data[rowSums(is.na(data[, -c(1:3)])) / 4  <= .5, ]
#>   id ethnicity age a1   a2   a3     a4
#> 1  1     asian  34  3    y  low  green
#> 2  2      <NA>  NA  4    y <NA>   <NA>
#> 3  3      <NA>  NA  5    y high   blue
#> 4  4      <NA>  NA  2 <NA>  med orange

CodePudding user response：

data[apply(is.na(data[, paste0('a', 1:4)]), 1, mean) <= .5, ]

Takes the appropriate columns, identifies NAs, computes proportions per row and selects rows where it is 0.5 or less.