Create a summary table for occurrences of certain values for each column in R-CodePudding

I am trying to produce a generic script which I can apply to multiple datasets to check for missing data, which may have been entered as NA, entered as 999 or as 0. Example:


df=read.table(text="Subject, var1, var2, var3, var4
A, 1, NA, 3, 999 
B, 0, 5, 2, 1 
C, 7, 6, NA, 7 
D, NA, 1, 0, 2 
E, NA, NA, NA, NA
F, 3, 4, 999, 4 
G, 4, NA, 1, 5 
H, 7, 0, 7, 999", header=TRUE, sep=",")

I would like too produce an end result that looks something like this:


   var1 var2 var3 var4
NA    2    3    2    1
999   0    0    1    2
0     1    1    1    0

I can easily do this for the NA values using

df %>% summarise_at(2:5, funs(sum(is.na(.))))

with the result

  var1 var2 var3 var4
1    2    3    2    1

and I now just want to do something similar for the 999 and 0 values. Also important to note is that a script which requires the entry of each individual column name won't be of great help as I have a lot of large datasets which need to be checked.

Thanks for your help!

CodePudding user response：

You may get the data in long format, keep only the values that you are interested in (0, 999, NA), count their occurrences for each column and get data in wide format.

library(dplyr)
library(tidyr)

missing_vals <- c(0, 999)

df %>%
  pivot_longer(cols = -Subject) %>%
  filter(value %in% missing_vals | is.na(value)) %>%
  count(name, value) %>%
  pivot_wider(names_from = name, values_from = n, values_fill = 0)

#  value  var1  var2  var3  var4
#  <int> <int> <int> <int> <int>
#1     0     1     1     1     0
#2    NA     2     3     2     1
#3   999     0     0     1     2

This actually is also a good use-case to use the new reframe function which will soon be available in dplyr 1.1.0. It avoids reshaping of data.

df %>%
  reframe(value = c(NA, 999, 0), 
          across(2:5, ~c(sum(is.na(.)), sum(. %in% 999), sum(. %in% 0))))

#  value var1 var2 var3 var4
#1    NA    2    3    2    1
#2   999    0    0    1    2
#3     0    1    1    1    0

data

The data shared has unwanted whitespace characters which can be avoided using strip.white = TRUE in read.table.

df=read.table(text="Subject, var1, var2, var3, var4
A, 1, NA, 3, 999 
B, 0, 5, 2, 1 
C, 7, 6, NA, 7 
D, NA, 1, 0, 2 
E, NA, NA, NA, NA
F, 3, 4, 999, 4 
G, 4, NA, 1, 5 
H, 7, 0, 7, 999", header=TRUE, sep=",", strip.white = TRUE)

CodePudding user response：

In base R:

rbind(`NA` = colSums(is.na(df[-1])),
      `0` = colSums(df[-1] == 0, na.rm = TRUE),
      `999` = colSums(df[-1] == 999, na.rm = TRUE))

    var1 var2 var3 var4
NA     2    3    2    1
0      1    1    1    0
999    0    0    1    2

Or, somewhat generalized:

x <- c(0, 999)
rb <- rbind(`NA` = colSums(is.na(df[-1])),
      t(sapply(x, \(x) colSums(df[-1] == x, na.rm = TRUE))))
rownames(rb)[-1] <- x