Home > database >  Count non-null values in each column of a dataframe in R
Count non-null values in each column of a dataframe in R

Time:02-19

I am working with the following dataframe:

cell_a  cell_b  cell_c  group
N/A     2.5     5       A 
1.2     3.6     N/A     A
3       2.1     3.2     A
N/A     N/A     1       B
1.2     N/A     N/A     B
2       N/A     N/A     B  

I would like to count the number of rows with a non-null value in each column and summarise it by group.

The outcome should be stored in new dataframe, such as:

cell_a  cell_b  cell_c  group
2       3       2       A 
2       0       1       B 

I tried with:

df_2 <- aggregate(df[1:3], list(df$group), length)

but it's indeed giving me the total length of each rows for each group. I've also tried adding na.action = na.omit or na.rm=TRUE after length but doesn't work.

What else can I use in this code to ignore the N/A values?

Thank you very much in advance for your help!

CodePudding user response:

With aggregate, use sum of non-NA elements (assuming the missing value is NA) as length returns the total number of elements (per group as we are grouping by group)

aggregate(. ~ group, df, FUN = function(x) sum(!is.na(x)), na.action = NULL)

If the NA value is a string element "N/A"

aggregate(. ~ group, df, FUN = function(x) sum(x != "N/A"), na.action = NULL)
   group cell_a cell_b cell_c
1     A      2      3      2
2     B      2      0      1

data

df <- structure(list(cell_a = c("N/A", "1.2", "3", "N/A", "1.2", "2"
), cell_b = c("2.5", "3.6", "2.1", "N/A", "N/A", "N/A"), cell_c = c("5", 
"N/A", "3.2", "1", "N/A", "N/A"), group = c("A", "A", "A", "B", 
"B", "B")), class = "data.frame", row.names = c(NA, -6L))

CodePudding user response:

Here is how we could do it with dplyr:

  1. change N\A not NA with na_if and across
  2. group
  3. summarise with across
library(dplyr)

df %>% 
  mutate(across(starts_with("cell"), ~na_if(., "N/A"))) %>% 
  group_by(group) %>% 
  summarise(across(starts_with("cell"), ~sum(!is.na(.))))

 group cell_a cell_b cell_c
  <chr>  <int>  <int>  <int>
1 A          2      3      2
2 B          2      0      1
  • Related