How to find the number of `NA` values in column names containing specific string and display number-CodePudding

Let's say I have a bunch of columns, and many of them contain "start_time" in their column names. How do I count the number of NA values in each of those columns and display the answer separately (NOT as a sum of all NA values found).

example output:

abc_start_time
## 5

xyz_start_time
## 1

ggg_start_time_end
## 0

or something similar.

CodePudding user response：

Use colSums() to sum the TRUE/FALSE with grep() to identify all the desired columns:

colSums(is.na(df[grep("start_time", names(df))]))

#     abc_start_time     xyz_start_time ggg_start_time_end 
#                 5                  2                  0

The part is.na(df[grep("start_time", names(df))] will return a boolean matrix (TRUE/FALSE) of all the columns with "starts_with" in the name. The colSums() part will sum all the TRUE as 1 and FALSE as 0 by column. Data

df <- data.frame(abc_start_time = seq.Date(as.Date("2023/01/01"), as.Date("2023/01/30"), by = "day"),
                 xyz_start_time = seq.Date(as.Date("2023/01/01"), as.Date("2023/01/30"), by = "day"),
                 ggg_start_time_end = seq.Date(as.Date("2023/01/01"), as.Date("2023/01/30"), by = "day"),
                 another_column = seq.Date(as.Date("2023/01/01"), as.Date("2023/01/30"), by = "day"))
df[c(1,3,5:7), 1] <- NA
df[c(6,7), 2] <- NA

CodePudding user response：

You could do:

lapply(df[grep("start_time", names(df))], function(x) sum(is.na(x)))
#> $abc_start_time
#> [1] 1
#> 
#> $def_start_time
#> [1] 2

Data used

df <- data.frame(abc_start_time = c(NA, 1, 2),
                 def_start_time = c(NA, NA, 3),
                 abc_end_time = c(NA, 2, 2))

CodePudding user response：

Using dplyr

library(dplyr)
df %>% 
  summarise(across(contains('start_time'), ~ sum(is.na(.x))))
 abc_start_time def_start_time
1              1              2

CodePudding user response：

We could use map_df() from purrr package:

library(dplyr)
library(purrr)

df %>% 
  select(contains("start_time")) %>% 
  map_df(~sum(is.na(.)))

  abc_start_time def_start_time
           <int>          <int>
1              1              2