I have a 1246 x 60,660 dataframe in R. A snippet of the data:
gene1 gene2 gene3
sample1 1615.7529 41.932474 697.9728
sample2 663.2001 8.602831 1198.1398
sample3 2406.1532 12.622443 1033.4625
sample4 836.3808 60.144235 259.3720
sample5 1217.8192 22.775497 695.9924
sample6 865.0344 15.350298 683.5397
sample7 935.3658 20.380676 540.6242
sample8 667.3883 56.939874 1056.6981
For each gene, I want each sample's value to be put into one of the following groups:
none = 0
ultra low = 1-4
low = 5 - 100
medium = 101 - 1000
high = 1000 & up
The end product would be another matrix/dataframe looking something like this:
gene1 gene2 gene3
none 0 0 0
ultra low 0 0 0
low 0 8 0
medium 5 0 5
high 3 0 3
How could I do this? After a bit of searching I think I'd probably end up using count
or maybe aggregate
? But I'm not sure exactly how to apply it to each column. Most of the examples I've seen were counting only one column.
CodePudding user response:
It may be easier with cut
- i.e. specify the breaks
and the corresponding labels
in each of the columns by looping with lapply
and get the frequency count with table
sapply(df1, \(x) table(cut(x, breaks = c(0, 1, 5, 101, 1001, Inf),
labels = c("none", "ultra low", "low", "medium", "high"))))
-output
gene1 gene2 gene3
none 0 0 0
ultra low 0 0 0
low 0 8 0
medium 5 0 5
high 3 0 3
Or as @ZheyuanLi mentioned, tabulate
could be faster
lbls <- c("none", "ultra low", "low", "medium", "high")
out <- sapply(df1, \(x) tabulate(cut(x, breaks = c(0, 1, 5, 101, 1001, Inf),
labels = lbls), nbins = length(lbls)))
row.names(out) <- lbls
-output
> out
gene1 gene2 gene3
none 0 0 0
ultra low 0 0 0
low 0 8 0
medium 5 0 5
high 3 0 3
data
df1 <- structure(list(gene1 = c(1615.7529, 663.2001, 2406.1532, 836.3808,
1217.8192, 865.0344, 935.3658, 667.3883), gene2 = c(41.932474,
8.602831, 12.622443, 60.144235, 22.775497, 15.350298, 20.380676,
56.939874), gene3 = c(697.9728, 1198.1398, 1033.4625, 259.372,
695.9924, 683.5397, 540.6242, 1056.6981)), class = "data.frame",
row.names = c("sample1",
"sample2", "sample3", "sample4", "sample5", "sample6", "sample7",
"sample8"))