Home > front end >  Count the number of NAs in multiple columns after grouping a dataframe in R
Count the number of NAs in multiple columns after grouping a dataframe in R

Time:08-06

I have a data frame df that is a date Date, a region Region, and an identifier ID followed by then a series of 12 variables (ex. V1 V2 ...) in columns:

col1 <- as.Date(c("01/01/01","01/02/01","01/03/01","01/01/01","01/02/01","01/03/01"))
col2 <- c(1,1,1,2,2,2)
col3 <- c(8,NA,NA,4,3,5)
col4 <- c(NA,NA,NA,5,3,NA)

df <- data.frame(col1,col2,col3,col4) 
colnames(df) <-c("Date", "ID", "V1", "V2")
df

        Date Region ID V1 V2
1 2022-01-01     R1  1  8 NA
2 2022-01-02     R1  1 NA NA
3 2022-01-03     R1  1 NA NA
4 2022-01-01     R1  2  4  5
5 2022-01-02     R1  2  3  3
6 2022-01-03     R1  2  5 NA

Essentially I want to create another data frame that calculates the number of NAs per variable, grouping by ID and Region (date can be ignored). I managed to get it for one variable but am having trouble adding the second through twelfth:

data.check <- as.data.frame(df %>% group_by(Region,ID) %>% summarise(sum(is.na(V1))))
data.check

  Region ID sum(is.na(V1))
1     R1  1              2
2     R1  2              0

I tried a few different formulas and techniques that I would usually use for adding columns but they do not incorporate the row grouping. The following are a few of the additional methods I tried:

#Successful for Variable 1
data.check <- as.data.frame(df %>% group_by(Region,ID) %>% summarise(sum(is.na(V1))))

#Tried adding further piping for the second variable - unsuccessful
data.check <- as.data.frame(df %>% group_by(Region,ID) %>% summarise(sum(is.na(V1))) %>% group_by(Region,ID) %>% summarise(sum(is.na(V2))))

#tried adding the piping to a second variable column - unsuccessful
data.check <- as.data.frame(df %>% group_by(Region,ID) %>% summarise(sum(is.na(V1))))
data.check$V2 <- as.data.frame(data.check %>% group_by(Region,ID) %>% summarise(sum(is.na(df$V2))))

#tried only adding the sum of NA but it does not group by row "ID and Region" values and adds the entire column total
data.check$V2 <- sum(is.na(df$V2))

Does anyone know how I can add the additional columns, counting the number of NAs in each variable for each ID? Apologies if there is an easy method I'm missing, I'm still pretty new to R. Thanks in advanced for your inputs!

CodePudding user response:

I propose two ways:
using dplyr:

df %>% 
  group_by(Region,ID) %>%
  summarise_each(list(na_count = ~sum(is.na(.))))

or data.table:

library(data.table)
setDT(df)[, lapply(.SD, function(x) sum(is.na(x))), by = .(Region, ID)]

CodePudding user response:

I really like this library called collapse

here is an example of how to do it in that library. It can be chained with dplyr if you need to go back to dplyr.

library(collapse)

df |> 
  fgroup_by(ID) |> 
  dapply(is.na) |> 
  fsum() |> 
  ungroup()

  ID Date V1 V2
1  1    0  2  3
2  2    0  0  1

In dplyr, the across function allows you to specify the columns, if you leave it empty, as I did and just left a comma, it summarizes everything


df |> 
  group_by(ID) |> 
  summarise(across(, ~ sum(is.na(.x))), .groups = "drop")
  • Related