Home > Mobile >  Count values in each column of R dataframe based on which range they're in
Count values in each column of R dataframe based on which range they're in

Time:07-21

I have a 1246 x 60,660 dataframe in R. A snippet of the data:

                     gene1             gene2              gene3
sample1          1615.7529         41.932474           697.9728
sample2           663.2001          8.602831          1198.1398
sample3          2406.1532         12.622443          1033.4625
sample4           836.3808         60.144235           259.3720
sample5          1217.8192         22.775497           695.9924
sample6           865.0344         15.350298           683.5397
sample7           935.3658         20.380676           540.6242
sample8           667.3883         56.939874          1056.6981

For each gene, I want each sample's value to be put into one of the following groups:

none = 0

ultra low = 1-4

low = 5 - 100

medium = 101 - 1000

high = 1000 & up

The end product would be another matrix/dataframe looking something like this:

                  gene1      gene2      gene3
none                  0          0          0 
ultra low             0          0          0
low                   0          8          0
medium                5          0          5
high                  3          0          3

How could I do this? After a bit of searching I think I'd probably end up using count or maybe aggregate? But I'm not sure exactly how to apply it to each column. Most of the examples I've seen were counting only one column.

CodePudding user response:

It may be easier with cut - i.e. specify the breaks and the corresponding labels in each of the columns by looping with lapply and get the frequency count with table

sapply(df1, \(x) table(cut(x, breaks = c(0, 1, 5, 101, 1001, Inf), 
   labels = c("none", "ultra low", "low", "medium", "high"))))

-output

           gene1 gene2 gene3
none          0     0     0
ultra low     0     0     0
low           0     8     0
medium        5     0     5
high          3     0     3

Or as @ZheyuanLi mentioned, tabulate could be faster

lbls <- c("none", "ultra low", "low", "medium", "high")
out <- sapply(df1, \(x) tabulate(cut(x, breaks = c(0, 1, 5, 101, 1001, Inf), 
   labels = lbls), nbins = length(lbls)))
row.names(out) <- lbls

-output

> out
          gene1 gene2 gene3
none          0     0     0
ultra low     0     0     0
low           0     8     0
medium        5     0     5
high          3     0     3

data

df1 <- structure(list(gene1 = c(1615.7529, 663.2001, 2406.1532, 836.3808, 
1217.8192, 865.0344, 935.3658, 667.3883), gene2 = c(41.932474, 
8.602831, 12.622443, 60.144235, 22.775497, 15.350298, 20.380676, 
56.939874), gene3 = c(697.9728, 1198.1398, 1033.4625, 259.372, 
695.9924, 683.5397, 540.6242, 1056.6981)), class = "data.frame", 
row.names = c("sample1", 
"sample2", "sample3", "sample4", "sample5", "sample6", "sample7", 
"sample8"))
  • Related