Let's say I have a bunch of columns, and many of them contain "start_time" in their column names. How do I count the number of NA
values in each of those columns and display the answer separately (NOT as a sum of all NA
values found).
example output:
abc_start_time
## 5
xyz_start_time
## 1
ggg_start_time_end
## 0
or something similar.
CodePudding user response:
Use colSums()
to sum the TRUE/FALSE with grep()
to identify all the desired columns:
colSums(is.na(df[grep("start_time", names(df))]))
# abc_start_time xyz_start_time ggg_start_time_end
# 5 2 0
The part is.na(df[grep("start_time", names(df))]
will return a boolean matrix (TRUE/FALSE) of all the columns with "starts_with" in the name. The colSums()
part will sum all the TRUE
as 1 and FALSE
as 0 by column.
Data
df <- data.frame(abc_start_time = seq.Date(as.Date("2023/01/01"), as.Date("2023/01/30"), by = "day"),
xyz_start_time = seq.Date(as.Date("2023/01/01"), as.Date("2023/01/30"), by = "day"),
ggg_start_time_end = seq.Date(as.Date("2023/01/01"), as.Date("2023/01/30"), by = "day"),
another_column = seq.Date(as.Date("2023/01/01"), as.Date("2023/01/30"), by = "day"))
df[c(1,3,5:7), 1] <- NA
df[c(6,7), 2] <- NA
CodePudding user response:
You could do:
lapply(df[grep("start_time", names(df))], function(x) sum(is.na(x)))
#> $abc_start_time
#> [1] 1
#>
#> $def_start_time
#> [1] 2
Data used
df <- data.frame(abc_start_time = c(NA, 1, 2),
def_start_time = c(NA, NA, 3),
abc_end_time = c(NA, 2, 2))
CodePudding user response:
Using dplyr
library(dplyr)
df %>%
summarise(across(contains('start_time'), ~ sum(is.na(.x))))
abc_start_time def_start_time
1 1 2
CodePudding user response:
We could use map_df()
from purrr
package:
library(dplyr)
library(purrr)
df %>%
select(contains("start_time")) %>%
map_df(~sum(is.na(.)))
abc_start_time def_start_time
<int> <int>
1 1 2