Home > Mobile >  Compare two column pairs in R
Compare two column pairs in R

Time:10-22

I have a dataframe like this below.

df <- structure(list(`chr1:110363793:G:C_A1` = c("1", "2", "2", "2"
), `chr1:110363793:G:C_A2` = c("2", "1", "2", "2"), `chr1:110363823:A:G_A1` = c("2", 
"2", "2", "2"), `chr1:110363823:A:G_A2` = c("2", "2", "2", "2"
), `chr1:110363849:A:G_A1` = c("2", "2", "2", "2"), `chr1:110363849:A:G_A2` = c("2", 
"2", "1", "2")), row.names = c("4_653_99", "4_677_99", "25_9_172", 
"27_135_62706"), class = "data.frame")

I want to compare the columns in pairs starting with the first two columns. For every row in columns, if the values are the same, I want to add 0 in a new column Status and if the values are different, I want 1. Result should look like this:

             chr1:110363793:G:C_A1 chr1:110363793:G:C_A2 chr1:110363823:A:G_A1 chr1:110363823:A:G_A2 chr1:110363849:A:G_A1 chr1:110363849:A:G_A2 chr1:110363793:G:C_Status chr1:110363823:A:G_Status
4_653_99                         2                     1                     2                     2                     2                     2                         1                         0
4_677_99                         2                     2                     2                     2                     2                     2                         0                         0
25_9_172                         2                     2                     2                     2                     2                     2                         0                         0
27_135_62706                     2                     2                     2                     1                     2                     2                         0                         1
             chr1:110363849:A:G_Status
4_653_99                             0
4_677_99                             0
25_9_172                             0
27_135_62706                         0

I can do this using for loops, but is there a better way to solve this problem?

My For Loop to solve this problem:

vars <- unique(sapply(strsplit(colnames(df),"_"), `[`, 1))
    for (i in 1:length(vars)){
      df[, paste(vars[i], "Status", sep = "_")] <- as.numeric(df[,paste0(vars[i], "_A1")]!=df[,paste0(vars[i], "_A2")])
    }

CodePudding user response:

Up front, this produces a not-preferred frame due to the repeated Status names.

odds <- seq(1, ncol(dat), by = 2)
odds
# [1] 1 3 5
do.call(cbind, unname(lapply(split.default(dat, rep(odds, each = 2)), 
                             function(z) cbind(z, Status =  (z[[1]] != z[[2]])))))
#              chr1:110363793:G:C_A1 chr1:110363793:G:C_A2 Status chr1:110363823:A:G_A1 chr1:110363823:A:G_A2 Status chr1:110363849:A:G_A1 chr1:110363849:A:G_A2 Status
# 4_653_99                         1                     2      1                     2                     2      0                     2                     2      0
# 4_677_99                         2                     1      1                     2                     2      0                     2                     2      0
# 25_9_172                         2                     2      0                     2                     2      0                     2                     1      1
# 27_135_62706                     2                     2      0                     2                     2      0                     2                     2      0

If you need the names to be unique, perhaps counting along the pairs, then perhaps something like this:

do.call(cbind, Map(function(z, nm) setNames(cbind(z, Status =  (z[[1]] != z[[2]])), c(names(z), nm)),
                   split.default(dat, rep(odds, each = 2)), 
                   paste0("Status_", seq_along(odds)), USE.NAMES = FALSE))
#              chr1:110363793:G:C_A1 chr1:110363793:G:C_A2 Status_1 chr1:110363823:A:G_A1 chr1:110363823:A:G_A2 Status_2 chr1:110363849:A:G_A1 chr1:110363849:A:G_A2 Status_3
# 4_653_99                         1                     2        1                     2                     2        0                     2                     2        0
# 4_677_99                         2                     1        1                     2                     2        0                     2                     2        0
# 25_9_172                         2                     2        0                     2                     2        0                     2                     1        1
# 27_135_62706                     2                     2        0                     2                     2        0                     2                     2        0

Walk-through:

  • split.default is really just splitting along a counter; we can't use split by itself, since R's S3 method dispatch calls split.data.frame, which splits by row instead of our desired by column.

  • we lapply across this, and each time the anon-function is called, z is a two-column frame;

  • from there, the internal cbind adds a status column that provides the 0s and 1s as desired.

(FYI: this will not work as intended if dat has an odd number of columns ...)

CodePudding user response:

Update: Improved coding!

library(dplyr)
library(tidyr)
df %>% 
  mutate(rn = row_number()) %>%
  pivot_longer(cols = -rn, names_to = c('grp', '.value'),
               names_sep = "_") %>% 
  group_by(grp) %>%
  transmute(rn, STATUS=ifelse(A1==A2, 1,0)) %>% 
  pivot_wider(names_from = grp, values_from = STATUS,        
              names_glue = "{.value}.{grp}") %>%
  select(-rn) %>%
  bind_cols(df, .) %>% 
  relocate(1,2,7,3,4,8,5,6,9)

output:

             chr1:110363793:G:C_A1 chr1:110363793:G:C_A2 STATUS.chr1:110363793:G:C chr1:110363823:A:G_A1 chr1:110363823:A:G_A2 STATUS.chr1:110363823:A:G chr1:110363849:A:G_A1 chr1:110363849:A:G_A2 STATUS.chr1:110363849:A:G
4_653_99                         1                     2                         0                     2                     2                         1                     2                     2                         1
4_677_99                         2                     1                         0                     2                     2                         1                     2                     2                         1
25_9_172                         2                     2                         1                     2                     2                         1                     2                     1                         0
27_135_62706                     2                     2                         1                     2                     2                         1                     2                     2                         1

First answer: Here is a dplyr solution: Drawback the new columns have to be renamed by hand:

library(dplyr)
library(tidyr)
df %>% 
  mutate(rn = row_number()) %>%
  pivot_longer(cols = -rn, names_to = c('grp', '.value'),
               names_sep = "_") %>% 
  group_by(grp) %>%
  transmute(rn, new=ifelse(A1==A2, 1,0)) %>% 
  pivot_wider(names_from = grp, values_from = new) %>%
  select(-rn) %>%
  bind_cols(df, .)
             chr1:110363793:G:C_A1 chr1:110363793:G:C_A2 chr1:110363823:A:G_A1 chr1:110363823:A:G_A2 chr1:110363849:A:G_A1 chr1:110363849:A:G_A2 chr1:110363793:G:C chr1:110363823:A:G chr1:110363849:A:G
4_653_99                         1                     2                     2                     2                     2                     2                  0                  1                  1
4_677_99                         2                     1                     2                     2                     2                     2                  0                  1                  1
25_9_172                         2                     2                     2                     2                     2                     1                  1                  1                  0
27_135_62706                     2                     2                     2                     2                     2                     2                  1                  1                  1

CodePudding user response:

An easy base R solution with sapply:

new <- sapply( (1:ncol(df))[1:ncol(df)%%2==1],
                function(x) as.numeric(df[,x]!=df[,(x 1)] ) )
colnames(new) <- paste( unique(sub("_.*","",colnames(df))), "Status", sep="_" )

gives:

data.frame( df, new )
             chr1.110363793.G.C_A1 chr1.110363793.G.C_A2 chr1.110363823.A.G_A1
4_653_99                         1                     2                     2
4_677_99                         2                     1                     2
25_9_172                         2                     2                     2
27_135_62706                     2                     2                     2
             chr1.110363823.A.G_A2 chr1.110363849.A.G_A1 chr1.110363849.A.G_A2
4_653_99                         2                     2                     2
4_677_99                         2                     2                     2
25_9_172                         2                     2                     1
27_135_62706                     2                     2                     2
             chr1.110363793.G.C_Status chr1.110363823.A.G_Status
4_653_99                             1                         0
4_677_99                             1                         0
25_9_172                             0                         0
27_135_62706                         0                         0
             chr1.110363849.A.G_Status
4_653_99                             0
4_677_99                             0
25_9_172                             1
27_135_62706                         0
  •  Tags:  
  • r
  • Related