I have a large dataset with multiple rows and columns. I want to change the order of the columns by the amount of missing values in the column so that the variable with the most NA
s is the first column and the variable with the least NA
s is the last column.
So far I tried to use dplyr
's select
but did not get what I wanted.
df_ordered <- df %>%
select(order(is.na(df)))
CodePudding user response:
you do:
df <- df[order(-colSums(apply(df, 2, is.na)))]
or
df <- df[order(colSums(apply(df, 2, is.na)), decreasing = T)]
with dplyr is more verbose:
df <- df %>% relocate(
df %>% summarise(across(everything() , ~sum(is.na(.)))) %>%
unlist() %>% order(decreasing=T))
CodePudding user response:
Here is an option using dplyr
with colSums
, where we can get the sum of NA
s in each column, then sort that named numeric vector, then we return the names for the select
statement.
library(dplyr)
df %>%
select(names(sort(colSums(is.na(.)), decreasing = TRUE)))
Output
X10 X5 X6 X1 X2 X3 X4 X8 X9 X7
1 NA 0 NA NA 1 NA NA NA 1 NA
2 NA 1 NA NA 0 1 0 1 NA 0
3 0 NA NA 1 NA 1 1 1 0 1
4 NA NA 1 0 1 0 1 NA NA 1
5 NA NA 0 NA 0 0 0 1 0 1
6 1 NA NA 1 0 NA 1 0 1 1
7 1 1 1 1 0 0 1 0 0 1
8 NA 0 1 1 1 0 0 1 1 0
9 NA NA 0 0 NA 0 0 1 1 0
10 NA 1 NA 0 0 0 NA 1 0 1
Data
df <- structure(list(X1 = c(NA, NA, 1L, 0L, NA, 1L, 1L, 1L, 0L, 0L),
X2 = c(1L, 0L, NA, 1L, 0L, 0L, 0L, 1L, NA, 0L), X3 = c(NA,
1L, 1L, 0L, 0L, NA, 0L, 0L, 0L, 0L), X4 = c(NA, 0L, 1L, 1L,
0L, 1L, 1L, 0L, 0L, NA), X5 = c(0L, 1L, NA, NA, NA, NA, 1L,
0L, NA, 1L), X6 = c(NA, NA, NA, 1L, 0L, NA, 1L, 1L, 0L, NA
), X7 = c(NA, 0L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 1L), X8 = c(NA,
1L, 1L, NA, 1L, 0L, 0L, 1L, 1L, 1L), X9 = c(1L, NA, 0L, NA,
0L, 1L, 0L, 1L, 1L, 0L), X10 = c(NA, NA, 0L, NA, NA, 1L,
1L, NA, NA, NA)), class = "data.frame", row.names = c(NA,
-10L))
CodePudding user response:
Your code is on the right track, but you need to map across columns, and use mean
(or sum
) to collapse each column to a single number.
library(dplyr)
library(purrr)
dat <- data.frame(
x2 = c(1, NA, 2, NA),
x3 = c(NA, NA, NA, 4),
x1 = c(1, 2, NA, 4)
)
dat %>%
select(rev(order(map_dbl(., ~ mean(is.na(.x))))))
x3 x2 x1
1 NA 1 1
2 NA NA 2
3 NA 2 NA
4 4 NA 4
CodePudding user response:
Slight variation on answer from Ric Villalba going for brevity in base R and {dplyr}:
set.seed(1)
d <- data.frame(matrix(sample(c(1:5, NA), 49, T, prob = c(rep(1, 5), 5)), ncol = 7))
# base R
d[,order(-colSums(is.na(d)))]
#> X4 X2 X5 X1 X3 X6 X7
#> 1 NA 4 2 NA 5 4 5
#> 2 4 4 NA NA NA 5 3
#> 3 NA NA NA 3 5 NA 3
#> 4 NA NA 3 1 1 5 5
#> 5 NA NA NA NA NA NA NA
#> 6 NA 4 NA 2 5 2 NA
#> 7 NA NA 2 1 1 4 5
# with tidyverse
library(dplyr)
d %>%
select(order(-colSums(is.na(.))))
#> X4 X2 X5 X1 X3 X6 X7
#> 1 NA 4 2 NA 5 4 5
#> 2 4 4 NA NA NA 5 3
#> 3 NA NA NA 3 5 NA 3
#> 4 NA NA 3 1 1 5 5
#> 5 NA NA NA NA NA NA NA
#> 6 NA 4 NA 2 5 2 NA
#> 7 NA NA 2 1 1 4 5
Created on 2022-11-07 with reprex v2.0.2