How to change the order of the columns by the amount of NA in the column in R?-CodePudding

I have a large dataset with multiple rows and columns. I want to change the order of the columns by the amount of missing values in the column so that the variable with the most NAs is the first column and the variable with the least NAs is the last column.

So far I tried to use dplyr's select but did not get what I wanted.

df_ordered <- df %>% 
    select(order(is.na(df)))

CodePudding user response：

you do:

df <- df[order(-colSums(apply(df, 2, is.na)))]

df <- df[order(colSums(apply(df, 2, is.na)), decreasing = T)]

with dplyr is more verbose:

df <- df %>% relocate(
  df %>% summarise(across(everything() , ~sum(is.na(.)))) %>% 
    unlist() %>% order(decreasing=T))

CodePudding user response：

Here is an option using dplyr with colSums, where we can get the sum of NAs in each column, then sort that named numeric vector, then we return the names for the select statement.

library(dplyr)

df %>% 
  select(names(sort(colSums(is.na(.)), decreasing = TRUE)))

Output

   X10 X5 X6 X1 X2 X3 X4 X8 X9 X7
1   NA  0 NA NA  1 NA NA NA  1 NA
2   NA  1 NA NA  0  1  0  1 NA  0
3    0 NA NA  1 NA  1  1  1  0  1
4   NA NA  1  0  1  0  1 NA NA  1
5   NA NA  0 NA  0  0  0  1  0  1
6    1 NA NA  1  0 NA  1  0  1  1
7    1  1  1  1  0  0  1  0  0  1
8   NA  0  1  1  1  0  0  1  1  0
9   NA NA  0  0 NA  0  0  1  1  0
10  NA  1 NA  0  0  0 NA  1  0  1

Data

df <- structure(list(X1 = c(NA, NA, 1L, 0L, NA, 1L, 1L, 1L, 0L, 0L), 
    X2 = c(1L, 0L, NA, 1L, 0L, 0L, 0L, 1L, NA, 0L), X3 = c(NA, 
    1L, 1L, 0L, 0L, NA, 0L, 0L, 0L, 0L), X4 = c(NA, 0L, 1L, 1L, 
    0L, 1L, 1L, 0L, 0L, NA), X5 = c(0L, 1L, NA, NA, NA, NA, 1L, 
    0L, NA, 1L), X6 = c(NA, NA, NA, 1L, 0L, NA, 1L, 1L, 0L, NA
    ), X7 = c(NA, 0L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 1L), X8 = c(NA, 
    1L, 1L, NA, 1L, 0L, 0L, 1L, 1L, 1L), X9 = c(1L, NA, 0L, NA, 
    0L, 1L, 0L, 1L, 1L, 0L), X10 = c(NA, NA, 0L, NA, NA, 1L, 
    1L, NA, NA, NA)), class = "data.frame", row.names = c(NA, 
-10L))

CodePudding user response：

Your code is on the right track, but you need to map across columns, and use mean (or sum) to collapse each column to a single number.

library(dplyr)
library(purrr)

dat <- data.frame(
  x2 = c(1, NA, 2, NA),
  x3 = c(NA, NA, NA, 4),
  x1 = c(1, 2, NA, 4)
)

dat %>%
  select(rev(order(map_dbl(., ~ mean(is.na(.x))))))

  x3 x2 x1
1 NA  1  1
2 NA NA  2
3 NA  2 NA
4  4 NA  4

CodePudding user response：

Slight variation on answer from Ric Villalba going for brevity in base R and {dplyr}:

set.seed(1)
d <- data.frame(matrix(sample(c(1:5, NA), 49, T, prob = c(rep(1, 5), 5)), ncol = 7))

# base R
d[,order(-colSums(is.na(d)))]
#>   X4 X2 X5 X1 X3 X6 X7
#> 1 NA  4  2 NA  5  4  5
#> 2  4  4 NA NA NA  5  3
#> 3 NA NA NA  3  5 NA  3
#> 4 NA NA  3  1  1  5  5
#> 5 NA NA NA NA NA NA NA
#> 6 NA  4 NA  2  5  2 NA
#> 7 NA NA  2  1  1  4  5

# with tidyverse
library(dplyr)

d %>% 
  select(order(-colSums(is.na(.))))
#>   X4 X2 X5 X1 X3 X6 X7
#> 1 NA  4  2 NA  5  4  5
#> 2  4  4 NA NA NA  5  3
#> 3 NA NA NA  3  5 NA  3
#> 4 NA NA  3  1  1  5  5
#> 5 NA NA NA NA NA NA NA
#> 6 NA  4 NA  2  5  2 NA
#> 7 NA NA  2  1  1  4  5

^{Created on 2022-11-07 with reprex v2.0.2}