Home > other >  How to do two-by-two column comparison thanks to the column name on an R datframe?
How to do two-by-two column comparison thanks to the column name on an R datframe?

Time:01-03

I have a dataset with different columns and I would like to compare a column with its replicate. I have two replicates for each sample. The name of replica 1 is of this type: name_sample_1 and of replica 2: name_sample_1_2.

I would like to compare replicate 1 and replicate 2 for each sample: if a value is present in one replicate and in the other it is 0, I would like to replace the two values ​​by NA.

Original

replicate_1 replicate_1_2
0 0
750 0
0 850
650 950

Wanted

replicate_1 replicate_1_2
0 0
NAN NAN
NAN NAN
650 950

here is a screenshot of my header

enter image description here

CodePudding user response:

You can do this vectorized base R approach with indexing, which isolates all the rows with the given conditions and replaces all values across columns with NA:

df[(df$replicate_1 == 0 | df$replicate_1_2 == 0) &
     !(df$replicate_1 == 0 & df$replicate_1_2 == 0), ] <- NA

Output:

#   replicate_1 replicate_1_2
# 1           0             0
# 2          NA            NA
# 3          NA            NA
# 4         650           950

 # Data
df <- read.table(text = "replicate_1    replicate_1_2
0   0
750 0
0   850
650 950", header = TRUE)

Note that this replaces values across all columns with NA - if you only want to replace values with NA in certain columns, you can specify them:

Example data building off of what you provided, adding an extra column to ignore (keep values):

df2 <- read.table(text = "replicate_1   replicate_1_2 ignore_column
0   0 A
750 0 B
0   850 C
650 950 D", header = TRUE)

df2[(df2$replicate_1 == 0 | df2$replicate_1_2 == 0) &
     !(df2$replicate_1 == 0 & df2$replicate_1_2 == 0), 
     c("replicate_1", "replicate_1_2")] <- NA

Output:

#   replicate_1 replicate_1_2 ignore_column
# 1           0             0             A
# 2          NA            NA             B
# 3          NA            NA             C
# 4         650           950             D

CodePudding user response:

Here's a solution that extends to n pairs of columns with the same prefix.

First, use reproducible data. There are two pairs of columns with the same prefix:

dat <- structure(list(some_name_replicate_1 = c(0L, 750L, 0L, 650L), 
    some_name_replicate_1_2 = c(0L, 0L, 850L, 950L), some_othername_replicate_1 = c(0L, 
    750L, 0L, 0L), some_othername_replicate_1_2 = c(0L, 0L, 0L, 
    950L)), class = "data.frame", row.names = c(NA, -4L))

#   some_name_replicate_1 some_name_replicate_1_2 some_othername_replicate_1 some_othername_replicate_1_2
# 1                     0                       0                          0                            0
# 2                   750                       0                        750                            0
# 3                     0                     850                          0                            0
# 4                   650                     950                          0                          950

The code consists of:

  1. Split the columns according to their prefix and create a list
  2. Replace the necessary values to NAs
  3. Reduce the list to the original dataframe format
dat |>
  split.default(gsub("_replicate_1.*", "", colnames(dat))) |>
  lapply(function(x) {
    x[x[1] * x[2] == 0 & x[1]   x[2] != 0, ] <- NA
    x
  }) |>
  Reduce(f = cbind)

# some_name_replicate_1 some_name_replicate_1_2 some_othername_replicate_1 some_othername_replicate_1_2
# 1                     0                       0                          0                            0
# 2                    NA                      NA                         NA                           NA
# 3                    NA                      NA                          0                            0
# 4                   650                     950                         NA                           NA
  • Related