How to find the maximum x value for each column and report the corresponding y value?-CodePudding

Using R.Studio I have a table of raw data from a DNA size distribution plot for hundreds of samples. The RFU (y values) are arranged in columns for each sample with the same size (x values) in a separate column - see below.

Size distribution graph example for visualisation

Example data: (made up values just to show the format of the table)

sample001_rfu	sample002_rfu	sample003_rfu	size_bp
5678	4567	3456	1000
8901	7890	6789	5000
10234	10123	10010	10000
12356	12345	11234	15000
15678	14567	13445	20000
13890	16589	15624	25000
10987	13425	17245	30000
8902	11323	15428	35000
6513	8919	12879	40000
4178	6528	10256	45000
3213	4380	8621	50000

I am trying to find the maximum y value (RFU) for all samples (i.e. max value in each column) and report the corresponding x value (size) which will be used for downstream automated sample processing planning.

So, in the table above:

sample001 = 20000bp (max rfu = 15678)
sample002 = 25000bp (max rfu = 16589)
sample003 = 30000bp (max rfu = 17245)

I have used the following to do this for one sample:

df$size_bp[which.max(df$sample001_rfu)]

However, I cannot seem to find a solution to repeat this for each sample_rfu (column) in the table without manually replacing the sample id in the code above. I would then like to store these values and their sample IDs (column header) as a list which will later be compared against different processing thresholds.

Any suggestions would be greatly appreciated!

CodePudding user response：

base R

dat$size_bp[ sapply(dat[,-4], which.max) ]
# [1] 20000 25000 30000

## named
setNames(dat$size_bp[ sapply(dat[,-4], which.max) ], names(dat[,-4]))
# sample001_rfu sample002_rfu sample003_rfu 
#         20000         25000         30000

dplyr

library(dplyr)
dat %>%
  summarize(across(-size_bp, ~ size_bp[which.max(.)]))
#   sample001_rfu sample002_rfu sample003_rfu
# 1         20000         25000         30000

data.table

library(data.table)
DT <- as.data.table(dat) # setDT is the preferred/canonical method
DT[, lapply(.SD, function(z) size_bp[which.max(z)]), .SDcols = patterns("^sample")]
#    sample001_rfu sample002_rfu sample003_rfu
#            <int>         <int>         <int>
# 1:         20000         25000         30000

Data

dat <- structure(list(sample001_rfu = c(5678L, 8901L, 10234L, 12356L, 15678L, 13890L, 10987L, 8902L, 6513L, 4178L, 3213L), sample002_rfu = c(4567L, 7890L, 10123L, 12345L, 14567L, 16589L, 13425L, 11323L, 8919L, 6528L, 4380L), sample003_rfu = c(3456L, 6789L, 10010L, 11234L, 13445L, 15624L, 17245L, 15428L, 12879L, 10256L, 8621L), size_bp = c(1000L, 5000L, 10000L, 15000L, 20000L, 25000L, 30000L, 35000L, 40000L, 45000L, 50000L)), class = "data.frame", row.names = c(NA, -11L))

CodePudding user response：

Here's another base R method:

samp_cols = names(df)[startsWith(names(df), "sample")]

result = lapply(samp_cols, function(x){
       mx = which.max(df[[x]])
       list(sample = x, max_rfu = df[mx, x], bp = df[mx, "size_bp"])
})

do.call(rbind, result)
#      sample          max_rfu bp   
# [1,] "sample001_rfu" 15678   20000
# [2,] "sample002_rfu" 16589   25000
# [3,] "sample003_rfu" 17245   30000

Using this data:

df = read.table(text = 'sample001_rfu   sample002_rfu   sample003_rfu   size_bp
5678    4567    3456    1000
8901    7890    6789    5000
10234   10123   10010   10000
12356   12345   11234   15000
15678   14567   13445   20000
13890   16589   15624   25000
10987   13425   17245   30000
8902    11323   15428   35000
6513    8919    12879   40000
4178    6528    10256   45000
3213    4380    8621    50000', header = T)

CodePudding user response：

Here is another tidyverse approach:

library(dplyr)
library(tidyr)

df %>% 
  pivot_longer(-size_bp) %>% 
  group_by(name) %>% 
  slice_max(n=1, value)

 size_bp name          value
    <int> <chr>         <int>
1   20000 sample001_rfu 15678
2   25000 sample002_rfu 16589
3   30000 sample003_rfu 17245