Using R.Studio I have a table of raw data from a DNA size distribution plot for hundreds of samples. The RFU (y values) are arranged in columns for each sample with the same size (x values) in a separate column - see below.
Size distribution graph example for visualisation
Example data: (made up values just to show the format of the table)
sample001_rfu | sample002_rfu | sample003_rfu | size_bp |
---|---|---|---|
5678 | 4567 | 3456 | 1000 |
8901 | 7890 | 6789 | 5000 |
10234 | 10123 | 10010 | 10000 |
12356 | 12345 | 11234 | 15000 |
15678 | 14567 | 13445 | 20000 |
13890 | 16589 | 15624 | 25000 |
10987 | 13425 | 17245 | 30000 |
8902 | 11323 | 15428 | 35000 |
6513 | 8919 | 12879 | 40000 |
4178 | 6528 | 10256 | 45000 |
3213 | 4380 | 8621 | 50000 |
I am trying to find the maximum y value (RFU) for all samples (i.e. max value in each column) and report the corresponding x value (size) which will be used for downstream automated sample processing planning.
So, in the table above:
- sample001 = 20000bp (max rfu = 15678)
- sample002 = 25000bp (max rfu = 16589)
- sample003 = 30000bp (max rfu = 17245)
I have used the following to do this for one sample:
df$size_bp[which.max(df$sample001_rfu)]
However, I cannot seem to find a solution to repeat this for each sample_rfu (column) in the table without manually replacing the sample id in the code above. I would then like to store these values and their sample IDs (column header) as a list which will later be compared against different processing thresholds.
Any suggestions would be greatly appreciated!
CodePudding user response:
base R
dat$size_bp[ sapply(dat[,-4], which.max) ]
# [1] 20000 25000 30000
## named
setNames(dat$size_bp[ sapply(dat[,-4], which.max) ], names(dat[,-4]))
# sample001_rfu sample002_rfu sample003_rfu
# 20000 25000 30000
dplyr
library(dplyr)
dat %>%
summarize(across(-size_bp, ~ size_bp[which.max(.)]))
# sample001_rfu sample002_rfu sample003_rfu
# 1 20000 25000 30000
data.table
library(data.table)
DT <- as.data.table(dat) # setDT is the preferred/canonical method
DT[, lapply(.SD, function(z) size_bp[which.max(z)]), .SDcols = patterns("^sample")]
# sample001_rfu sample002_rfu sample003_rfu
# <int> <int> <int>
# 1: 20000 25000 30000
Data
dat <- structure(list(sample001_rfu = c(5678L, 8901L, 10234L, 12356L, 15678L, 13890L, 10987L, 8902L, 6513L, 4178L, 3213L), sample002_rfu = c(4567L, 7890L, 10123L, 12345L, 14567L, 16589L, 13425L, 11323L, 8919L, 6528L, 4380L), sample003_rfu = c(3456L, 6789L, 10010L, 11234L, 13445L, 15624L, 17245L, 15428L, 12879L, 10256L, 8621L), size_bp = c(1000L, 5000L, 10000L, 15000L, 20000L, 25000L, 30000L, 35000L, 40000L, 45000L, 50000L)), class = "data.frame", row.names = c(NA, -11L))
CodePudding user response:
Here's another base R method:
samp_cols = names(df)[startsWith(names(df), "sample")]
result = lapply(samp_cols, function(x){
mx = which.max(df[[x]])
list(sample = x, max_rfu = df[mx, x], bp = df[mx, "size_bp"])
})
do.call(rbind, result)
# sample max_rfu bp
# [1,] "sample001_rfu" 15678 20000
# [2,] "sample002_rfu" 16589 25000
# [3,] "sample003_rfu" 17245 30000
Using this data:
df = read.table(text = 'sample001_rfu sample002_rfu sample003_rfu size_bp
5678 4567 3456 1000
8901 7890 6789 5000
10234 10123 10010 10000
12356 12345 11234 15000
15678 14567 13445 20000
13890 16589 15624 25000
10987 13425 17245 30000
8902 11323 15428 35000
6513 8919 12879 40000
4178 6528 10256 45000
3213 4380 8621 50000', header = T)
CodePudding user response:
Here is another tidyverse approach:
library(dplyr)
library(tidyr)
df %>%
pivot_longer(-size_bp) %>%
group_by(name) %>%
slice_max(n=1, value)
size_bp name value
<int> <chr> <int>
1 20000 sample001_rfu 15678
2 25000 sample002_rfu 16589
3 30000 sample003_rfu 17245