Home > database >  Counting NAs in a data frame for several subsets
Counting NAs in a data frame for several subsets

Time:06-03

Given the data frame

   grp      x1    x2    y1    y2
 1 A        NA    NA    NA     1
 2 A        NA     1     1     1
 3 A        NA     2     3     3
 4 A         2    NA     4     4
 5 A         3     1     3     1
 6 B         1     3     2     3
 7 B         2     4    NA     4
 8 B         3     3     2    NA
 9 B         4     1     1     1
10 B         4     3     3     4

I want to count the NAs for each group (grp) in the different subsets of columns:

  grp       x     y
1 A         5     1
2 B         0     2

What I already tried is

x_names <- c('x1', 'x2')
y_names <- c('y1', 'y2')
col_groups <- list(x_names, y_names)
df <- data.frame(grp = rep(c('A', 'B'), each = 5),
                 x1 = sample(x = 1:4, size = 10, replace = TRUE),
                 x2 = sample(x = 1:4, size = 10, replace = TRUE),
                 y1 = sample(x = 1:4, size = 10, replace = TRUE),
                 y2 = sample(x = 1:4, size = 10, replace = TRUE))
df[1,'x1'] <- NA
df[1,'x2'] <- NA
df[1,'y1'] <- NA
df[2,'x1'] <- NA
df[3,'x1'] <- NA
df[4,'x2'] <- NA
df[7,'y1'] <- NA
df[8,'y2'] <- NA
df <- group_by(df, grp)
df <- nest(.data = df, x = x_names, y = y_names)
df <- summarise_all(df,~ sum(is.na(.))) # works if not nested
df

Update:

My real column names don't have a specific pattern like x1, x2, ... This is why their names are stored in vectors.

If would be great if there is a solution applying this to all vectors stored in col_groups

Instead of absolute frequencies, relative frequencies would also be nice, but not neccesary!

CodePudding user response:

Here is a solution using tidyverse

library(tidyverse)

df %>% 
  split(.$grp) %>% 
  map(function(check_na){
  bind_cols(
    grp = unique(check_na[["grp"]]),
    'x' = check_na %>% select(starts_with('x')) %>% summarise_all(~ sum(is.na(.))) %>% rowSums(),
    'y' = check_na %>% select(starts_with('y')) %>% summarise_all(~ sum(is.na(.))) %>% rowSums(),
  )
}) %>% 
  bind_rows() 

Which gives us:

# A tibble: 2 × 3
  grp       x     y
  <chr> <dbl> <dbl>
1 A         5     1
2 B         0     2

CodePudding user response:

It can be helpful to reshape data to answer questions like these. For example

library(tidyverse)
df %>% 
  pivot_longer(-grp, names_pattern="(\\w )(\\d )", names_to=c("cat", "obs")) %>% 
  group_by(grp, cat) %>% 
  summarize(n=sum(is.na(value))) %>% 
  pivot_wider(grp, names_from=cat, values_from=n)

First we make the data long, do the summarization, and then reshape to wide.

CodePudding user response:

alternative tidy verse approach

df |> 
  pivot_longer(cols = -grp) |> 
  group_by(grp) |> 
  summarise(x = sum(is.na(value) & str_detect(name, "x")), y = sum(is.na(value) & str_detect(name, "y")))

pivot longer allows us to check for x and y's in one column

CodePudding user response:

summation of NAs in by.

by(dat, dat$grp, \(s) c(grp=el(s$grp), sapply(c('x', 'y'), \(n) sum(is.na(s[startsWith(names(s), n)]))))) |>
  do.call(what=rbind) |> as.data.frame()
#   grp x y
# A   A 5 1
# B   B 0 2

CodePudding user response:

A base R option using aggregate reshape

aggregate(
    cbind(x,y) ~ grp,
    reshape(
        cbind(
            id = 1:nrow(df),
            setNames(df, gsub("(\\d)", ".\\1", names(df)))
        ),
        direction = "long",
        idvar = c("id", "grp"),
        varying = -c(1:2)
    ),
    function(v) sum(is.na(v)),
    na.action = na.pass
)

gives

  grp x y
1   A 5 1
2   B 0 2
  •  Tags:  
  • r
  • Related