Home > Software engineering >  remove rows with >50% missing across certain columns in R
remove rows with >50% missing across certain columns in R

Time:10-14

here is my data:

data <- data.frame(id=c(1,2,3,4,5),
                   ethnicity=c("asian",NA,NA,NA,"asian"),
                   age=c(34,NA,NA,NA,65),
                   a1=c(3,4,5,2,7),
                   a2=c("y","y","y",NA,NA),
                   a3=c("low", NA, "high", "med", NA),
                   a4=c("green", NA, "blue", "orange", NA))


  id ethnicity age a1   a2   a3     a4
   1     asian  34  3    y  low  green
   2      <NA>  NA  4    y <NA>   <NA>
   3      <NA>  NA  5    y high   blue
   4      <NA>  NA  2 <NA>  med orange
   5     asian  65  7 <NA> <NA>   <NA>

I would like to remove rows that have >50% missing in columns a1 to a4.

I have tried the below code; but am having trouble specifying the columns that I want this to take effect for:

data[which(rowMeans(!is.na(data)) > 0.5), ] #This doesn't specify the column

miss2 <- c()
for(i in 1:nrow(data)) {
  if(length(which(is.na(data[4:7,]))) >= 0.5*ncol(data)) miss2 <- append(miss2,4:7) 
}
data1 <- data[-miss2,]

#I thought I specified the column here but im not getting the output I was hoping for (i.e id 4 doesn't show up)

The code above looks at missing in all columns. I would like to specify to just look for % of missing in columns a1,a2,a3,a4. What im hoping to get is below:

  id ethnicity age a1   a2   a3     a4
   1     asian  34  3    y  low  green
   2      <NA>  NA  4    y <NA>   <NA>
   3      <NA>  NA  5    y high   blue
   4      <NA>  NA  2 <NA>  med orange

Any help is appreciated, thank you!

CodePudding user response:

You're really close, the main issue being using which (an array of indices) instead of simply an array of booleans

keep <- rowMeans(is.na(data[,4:7])) <= 0.5

keep
[1]  TRUE  TRUE  TRUE  TRUE FALSE

data[keep,]
  id ethnicity age a1   a2   a3     a4
1  1     asian  34  3    y  low  green
2  2      <NA>  NA  4    y <NA>   <NA>
3  3      <NA>  NA  5    y high   blue
4  4      <NA>  NA  2 <NA>  med orange

CodePudding user response:

Just for fun a dplyr approach: Here we combine rowwise with a comparing statement directly in filter. First we check the sum of NA over a1:a4, divide by the amount of columns and ask if condition <= 0.5 is true:

To do this we have to transform all (a1:a4) to the same class:

data %>% 
  rowwise() %>% 
  mutate(a1 = as.character(a1)) %>% 
  filter(sum(is.na(c_across(a1:a4))) / length(c_across(a1:a4)) <= 0.5)
     id ethnicity   age a1    a2    a3    a4    
  <dbl> <chr>     <dbl> <chr> <chr> <chr> <chr> 
1     1 asian        34 3     y     low   green 
2     2 NA           NA 4     y     NA    NA    
3     3 NA           NA 5     y     high  blue  
4     4 NA           NA 2     NA    med   orange

CodePudding user response:

data[rowSums(is.na(data[, -c(1:3)])) / 4  <= .5, ]
#>   id ethnicity age a1   a2   a3     a4
#> 1  1     asian  34  3    y  low  green
#> 2  2      <NA>  NA  4    y <NA>   <NA>
#> 3  3      <NA>  NA  5    y high   blue
#> 4  4      <NA>  NA  2 <NA>  med orange

CodePudding user response:

data[apply(is.na(data[, paste0('a', 1:4)]), 1, mean) <= .5, ]

Takes the appropriate columns, identifies NAs, computes proportions per row and selects rows where it is 0.5 or less.

  • Related