Home > front end >  perform normalisation on pairs of columns
perform normalisation on pairs of columns

Time:11-10

I have a large dataframe:

df <- data.frame(S.1_Dxx = runif(100, min = 1, max = 3.5)
                 ,S.1_Px1 = runif(100, min = 0.5, max = 3)
                 ,S.2_Dxhfah = runif(100, min = 0.7, max = 2)
                 ,S.2_Pxhgm = runif(100, min = 0.4, max = 1.4)
                 ,S._Dxhgm = runif(100, min = 1, max = 2.5)
                 ,S._Pxhgm = runif(100, min = 0.4, max = 1.4)
)

The beginning of the column name always starts with S., then has 0-6 numbers followed by _. This prefix (from S to _) uniquely identifies pairs of columns that I would like to normalise.

I can do it manually:

library(limma)

normS1 <- df %>% 
  select(starts_with("S.1_")) %>%
  as.matrix() %>% 
  normalizeBetweenArrays(method = "scale") %>% 
  as.data.frame()

However, I would like to perform this normalisation for all column pairs.

I imagine the beginning would like like this:

multiNormDf <- df %>% 
  pivot_longer(everything(), names_to = "sample", values_to = "intensity") %>% 
  mutate(sampleGroup = word(sample, start = 1, sep = "_")) %>% 
  group_by(sampleGroup) 

And this is where I get stuck. How to ensure that within sampleGroup the two samples are seen as separate samples? Maybe I should instead slice the Df into smaller dataframes, perform this operation on each and then bind them back?

Also, I appreciate that normalising pairs of columns, rather than the whole dataset, is rarely a good idea. In this case, however, this seems the best course of action.

EDIT: I changed the type of normalisation (from vsn to normalizeBeteweenArrays, which probably more poeple are familiar with)

CodePudding user response:

Below is one approach using {dplyover} which should work in theory. Disclaimer: I'm the maintainer of {dplyover} which is not on CRAN.

We can use over() to loop over an object, here a string created with cut_names() which gets the first part of your variable names. We use this string as .x in across(starts_with(paste0(.x, "_") and this is your data.frame that you can now pipe into as.matrix() and then into justvsn(). Depending on the output we might need to wrap the result in list(). Maybe you don't need to as.data.frame() this depends on what justvsn() returns.

The columns names will be the strings from .x but we can adjust the names easily with over()'s .names argument.

df <- data.frame(S.1_Dxx = runif(100, min = 1, max = 3.5)
                 ,S.1_Px1 = runif(100, min = 0.5, max = 3)
                 ,S.2_Dxhfah = runif(100, min = 0.7, max = 2)
                 ,S.2_Pxhgm = runif(100, min = 0.4, max = 1.4)
                 ,S._Dxhgm = runif(100, min = 1, max = 2.5)
                 ,S._Pxhgm = runif(100, min = 0.4, max = 1.4)
)


library(dplyr)
library(dplyover)
library(limma)

normS1 <- df %>% 
  select(starts_with("S.1_")) %>%
  as.matrix() %>% 
  normalizeBetweenArrays(method = "scale") %>% 
  as.data.frame()

df %>% 
  summarise(over(cut_names("_.*", .vars = names(df)),
                 ~ across(starts_with(paste0(.x, "_"))) %>% 
                   as.matrix() %>% 
                   normalizeBetweenArrays(method = "scale") %>% 
                   as.data.frame() 
                 )) %>% 
  do.call("data.frame", .) %>% 
  as_tibble()#  for printing

#> # A tibble: 100 x 6
#>    S.1.S.1_Dxx S.1.S.1_Px1 S.2.S.2_Dxhfah S.2.S.2_Pxhgm S..S._Dxhgm S..S._Pxhgm
#>          <dbl>       <dbl>          <dbl>         <dbl>       <dbl>       <dbl>
#>  1       1.11         1.10          1.14          0.502       1.72        1.88 
#>  2       2.75         2.00          0.853         0.758       0.844       1.15 
#>  3       2.40         1.30          0.594         1.28        1.73        1.04 
#>  4       1.28         3.43          1.05          1.32        1.13        0.677
#>  5       1.90         2.92          1.50          1.49        1.33        0.597
#>  6       3.00         2.06          0.857         0.687       0.961       1.21 
#>  7       0.919        3.35          1.43          1.25        0.792       1.81 
#>  8       2.90         1.36          1.29          1.56        1.74        1.52 
#>  9       2.95         1.36          0.832         1.36        1.41        1.44 
#> 10       1.88         2.69          1.18          1.44        0.896       0.604
#> # ... with 90 more rows

Created on 2022-11-09 by the reprex package (v2.0.1)

  • Related