Given the data frame
grp x1 x2 y1 y2
1 A NA NA NA 1
2 A NA 1 1 1
3 A NA 2 3 3
4 A 2 NA 4 4
5 A 3 1 3 1
6 B 1 3 2 3
7 B 2 4 NA 4
8 B 3 3 2 NA
9 B 4 1 1 1
10 B 4 3 3 4
I want to count the NAs for each group (grp) in the different subsets of columns:
grp x y
1 A 5 1
2 B 0 2
What I already tried is
x_names <- c('x1', 'x2')
y_names <- c('y1', 'y2')
col_groups <- list(x_names, y_names)
df <- data.frame(grp = rep(c('A', 'B'), each = 5),
x1 = sample(x = 1:4, size = 10, replace = TRUE),
x2 = sample(x = 1:4, size = 10, replace = TRUE),
y1 = sample(x = 1:4, size = 10, replace = TRUE),
y2 = sample(x = 1:4, size = 10, replace = TRUE))
df[1,'x1'] <- NA
df[1,'x2'] <- NA
df[1,'y1'] <- NA
df[2,'x1'] <- NA
df[3,'x1'] <- NA
df[4,'x2'] <- NA
df[7,'y1'] <- NA
df[8,'y2'] <- NA
df <- group_by(df, grp)
df <- nest(.data = df, x = x_names, y = y_names)
df <- summarise_all(df,~ sum(is.na(.))) # works if not nested
df
Update:
My real column names don't have a specific pattern like x1, x2, ... This is why their names are stored in vectors.
If would be great if there is a solution applying this to all vectors stored in col_groups
Instead of absolute frequencies, relative frequencies would also be nice, but not neccesary!
CodePudding user response:
Here is a solution using tidyverse
library(tidyverse)
df %>%
split(.$grp) %>%
map(function(check_na){
bind_cols(
grp = unique(check_na[["grp"]]),
'x' = check_na %>% select(starts_with('x')) %>% summarise_all(~ sum(is.na(.))) %>% rowSums(),
'y' = check_na %>% select(starts_with('y')) %>% summarise_all(~ sum(is.na(.))) %>% rowSums(),
)
}) %>%
bind_rows()
Which gives us:
# A tibble: 2 × 3
grp x y
<chr> <dbl> <dbl>
1 A 5 1
2 B 0 2
CodePudding user response:
It can be helpful to reshape data to answer questions like these. For example
library(tidyverse)
df %>%
pivot_longer(-grp, names_pattern="(\\w )(\\d )", names_to=c("cat", "obs")) %>%
group_by(grp, cat) %>%
summarize(n=sum(is.na(value))) %>%
pivot_wider(grp, names_from=cat, values_from=n)
First we make the data long, do the summarization, and then reshape to wide.
CodePudding user response:
alternative tidy verse approach
df |>
pivot_longer(cols = -grp) |>
group_by(grp) |>
summarise(x = sum(is.na(value) & str_detect(name, "x")), y = sum(is.na(value) & str_detect(name, "y")))
pivot longer allows us to check for x and y's in one column
CodePudding user response:
sum
mation of NAs in by
.
by(dat, dat$grp, \(s) c(grp=el(s$grp), sapply(c('x', 'y'), \(n) sum(is.na(s[startsWith(names(s), n)]))))) |>
do.call(what=rbind) |> as.data.frame()
# grp x y
# A A 5 1
# B B 0 2
CodePudding user response:
A base R option using aggregate
reshape
aggregate(
cbind(x,y) ~ grp,
reshape(
cbind(
id = 1:nrow(df),
setNames(df, gsub("(\\d)", ".\\1", names(df)))
),
direction = "long",
idvar = c("id", "grp"),
varying = -c(1:2)
),
function(v) sum(is.na(v)),
na.action = na.pass
)
gives
grp x y
1 A 5 1
2 B 0 2