Home > Blockchain >  In grouped dataframe, summarize rows which contain certain value (e.g. zeros only) in set of columns
In grouped dataframe, summarize rows which contain certain value (e.g. zeros only) in set of columns

Time:10-01

Given the following dataframe:

dframe <- structure(list(id = c("294361-7349174-75411122", "294365-7645230-95464222", 
"291915-7345264-75464222", "291365-7345074-75164202", "594165-7345274-78444212", 
"234385-7335274-75464229", "734515-1345274-95464892", "201365-8345274-78464232", 
"294365-7315971-75464120", "591365-7345374-75464222", "394365-7345204-75411022", 
"494305-7345273-75464222", "291161-7345271-75461210", "294035-7345201-75464292", 
"298365-7345279-78864223", "294365-7345274-15964293", "294395-7345274-69464299", 
"899965-1345294-95464222", "194365-7145274-75464222", "194361-7349231-75464222", 
"294365-7345274-75464122", "191315-1345274-13464322", "794365-7349274-75464292", 
"214365-8318274-75464222", "394363-8341274-39494929"), gene = structure(c(3L, 
3L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("ABC_1", "C_1", "XYZ_123"
), class = "factor"), group = structure(c(2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L), .Label = c("KO", "WT"), class = "factor"), class_A = c(0, 
1, 0, 2, 1, 0, 0, 1, 0, 1, 0, 0, 0, 2, 2, 1, 0, 0, 0, 0, 1, 1, 
1, 0, 3), class_B = c(0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 
1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1)), row.names = c(NA, -25L), class = "data.frame")

I would like to produce a new dataframe for data grouped per "group" and "gene". I want to output sum of rows per group in which both class_A and class_B columns contain the same, desired value - now I am interested in zeros.

Based on the answers provided in this thread: Efficient way to create a dataframe with multiple summary columns based on a grouped dataframe using dplyr in R

I can achieve this with following code:

desired_dframe <- dframe %>% 
  group_by(group, gene) %>%
  summarise(counts_zero = sum(ifelse((class_A == 0 & class_B == 0), 1, 0))) 

However, the above approach has one pitfall: the column names are hardcoded. In real life, I have dataframes with various number of columns denoting classes (and other names, e.g. "class_C", "class_Z" etc.). The common part of their names, is "class_". Based on this, I would like to consider all of the columns of interest.

I was playing around with rowSums(dplyr::across(dplyr::starts_with('class_')==0)) to achieve this, yet with no avail. The function throws the error and I have no idea how to debug it.

Also, I was trying to incorporate this column into the @akrun's answer provided here: Efficient way to create a dataframe with multiple summary columns based on a grouped dataframe using dplyr in R

On the @akrun's request, I am putting this into the new thread.

CodePudding user response:

If it is to get the sum of class_ columns, use across or if_all (more correct) i.e. loop over the class_ columns in if_all, apply the condition .x ==0, which returns TRUE only if all the columns looped for that rows will be 0 or else it return FALSE. Do the sum directly on the logical vector (TRUE -> 1 and FALSE -> 0)

library(dplyr)
dframe %>% 
  group_by(group, gene) %>% 
  summarise(counts_zero = sum(if_all(starts_with('class_'),
     ~ .x == 0)), .groups = 'drop')

-output

# A tibble: 5 × 3
  group gene    counts_zero
  <fct> <fct>         <int>
1 KO    ABC_1             2
2 KO    XYZ_123           4
3 WT    ABC_1             2
4 WT    C_1               0
5 WT    XYZ_123           1
  • Related