I am trying to produce a generic script which I can apply to multiple datasets to check for missing data, which may have been entered as NA, entered as 999 or as 0. Example:
df=read.table(text="Subject, var1, var2, var3, var4
A, 1, NA, 3, 999
B, 0, 5, 2, 1
C, 7, 6, NA, 7
D, NA, 1, 0, 2
E, NA, NA, NA, NA
F, 3, 4, 999, 4
G, 4, NA, 1, 5
H, 7, 0, 7, 999", header=TRUE, sep=",")
I would like too produce an end result that looks something like this:
var1 var2 var3 var4
NA 2 3 2 1
999 0 0 1 2
0 1 1 1 0
I can easily do this for the NA values using
df %>% summarise_at(2:5, funs(sum(is.na(.))))
with the result
var1 var2 var3 var4
1 2 3 2 1
and I now just want to do something similar for the 999 and 0 values. Also important to note is that a script which requires the entry of each individual column name won't be of great help as I have a lot of large datasets which need to be checked.
Thanks for your help!
CodePudding user response:
You may get the data in long format, keep only the values that you are interested in (0, 999, NA), count their occurrences for each column and get data in wide format.
library(dplyr)
library(tidyr)
missing_vals <- c(0, 999)
df %>%
pivot_longer(cols = -Subject) %>%
filter(value %in% missing_vals | is.na(value)) %>%
count(name, value) %>%
pivot_wider(names_from = name, values_from = n, values_fill = 0)
# value var1 var2 var3 var4
# <int> <int> <int> <int> <int>
#1 0 1 1 1 0
#2 NA 2 3 2 1
#3 999 0 0 1 2
This actually is also a good use-case to use the new reframe
function which will soon be available in dplyr
1.1.0. It avoids reshaping of data.
df %>%
reframe(value = c(NA, 999, 0),
across(2:5, ~c(sum(is.na(.)), sum(. %in% 999), sum(. %in% 0))))
# value var1 var2 var3 var4
#1 NA 2 3 2 1
#2 999 0 0 1 2
#3 0 1 1 1 0
data
The data shared has unwanted whitespace characters which can be avoided using strip.white = TRUE
in read.table
.
df=read.table(text="Subject, var1, var2, var3, var4
A, 1, NA, 3, 999
B, 0, 5, 2, 1
C, 7, 6, NA, 7
D, NA, 1, 0, 2
E, NA, NA, NA, NA
F, 3, 4, 999, 4
G, 4, NA, 1, 5
H, 7, 0, 7, 999", header=TRUE, sep=",", strip.white = TRUE)
CodePudding user response:
In base R:
rbind(`NA` = colSums(is.na(df[-1])),
`0` = colSums(df[-1] == 0, na.rm = TRUE),
`999` = colSums(df[-1] == 999, na.rm = TRUE))
var1 var2 var3 var4
NA 2 3 2 1
0 1 1 1 0
999 0 0 1 2
Or, somewhat generalized:
x <- c(0, 999)
rb <- rbind(`NA` = colSums(is.na(df[-1])),
t(sapply(x, \(x) colSums(df[-1] == x, na.rm = TRUE))))
rownames(rb)[-1] <- x