I have a dataframe like this below.
df <- structure(list(`chr1:110363793:G:C_A1` = c("1", "2", "2", "2"
), `chr1:110363793:G:C_A2` = c("2", "1", "2", "2"), `chr1:110363823:A:G_A1` = c("2",
"2", "2", "2"), `chr1:110363823:A:G_A2` = c("2", "2", "2", "2"
), `chr1:110363849:A:G_A1` = c("2", "2", "2", "2"), `chr1:110363849:A:G_A2` = c("2",
"2", "1", "2")), row.names = c("4_653_99", "4_677_99", "25_9_172",
"27_135_62706"), class = "data.frame")
I want to compare the columns in pairs starting with the first two columns. For every row in columns, if the values are the same, I want to add 0 in a new column Status
and if the values are different, I want 1. Result should look like this:
chr1:110363793:G:C_A1 chr1:110363793:G:C_A2 chr1:110363823:A:G_A1 chr1:110363823:A:G_A2 chr1:110363849:A:G_A1 chr1:110363849:A:G_A2 chr1:110363793:G:C_Status chr1:110363823:A:G_Status
4_653_99 2 1 2 2 2 2 1 0
4_677_99 2 2 2 2 2 2 0 0
25_9_172 2 2 2 2 2 2 0 0
27_135_62706 2 2 2 1 2 2 0 1
chr1:110363849:A:G_Status
4_653_99 0
4_677_99 0
25_9_172 0
27_135_62706 0
I can do this using for loops, but is there a better way to solve this problem?
My For Loop to solve this problem:
vars <- unique(sapply(strsplit(colnames(df),"_"), `[`, 1))
for (i in 1:length(vars)){
df[, paste(vars[i], "Status", sep = "_")] <- as.numeric(df[,paste0(vars[i], "_A1")]!=df[,paste0(vars[i], "_A2")])
}
CodePudding user response:
Up front, this produces a not-preferred frame due to the repeated Status
names.
odds <- seq(1, ncol(dat), by = 2)
odds
# [1] 1 3 5
do.call(cbind, unname(lapply(split.default(dat, rep(odds, each = 2)),
function(z) cbind(z, Status = (z[[1]] != z[[2]])))))
# chr1:110363793:G:C_A1 chr1:110363793:G:C_A2 Status chr1:110363823:A:G_A1 chr1:110363823:A:G_A2 Status chr1:110363849:A:G_A1 chr1:110363849:A:G_A2 Status
# 4_653_99 1 2 1 2 2 0 2 2 0
# 4_677_99 2 1 1 2 2 0 2 2 0
# 25_9_172 2 2 0 2 2 0 2 1 1
# 27_135_62706 2 2 0 2 2 0 2 2 0
If you need the names to be unique, perhaps counting along the pairs, then perhaps something like this:
do.call(cbind, Map(function(z, nm) setNames(cbind(z, Status = (z[[1]] != z[[2]])), c(names(z), nm)),
split.default(dat, rep(odds, each = 2)),
paste0("Status_", seq_along(odds)), USE.NAMES = FALSE))
# chr1:110363793:G:C_A1 chr1:110363793:G:C_A2 Status_1 chr1:110363823:A:G_A1 chr1:110363823:A:G_A2 Status_2 chr1:110363849:A:G_A1 chr1:110363849:A:G_A2 Status_3
# 4_653_99 1 2 1 2 2 0 2 2 0
# 4_677_99 2 1 1 2 2 0 2 2 0
# 25_9_172 2 2 0 2 2 0 2 1 1
# 27_135_62706 2 2 0 2 2 0 2 2 0
Walk-through:
split.default
is really justsplit
ting along a counter; we can't usesplit
by itself, since R's S3 method dispatch callssplit.data.frame
, which splits by row instead of our desired by column.we
lapply
across this, and each time the anon-function is called,z
is a two-column frame;from there, the internal
cbind
adds a status column that provides the 0s and 1s as desired.
(FYI: this will not work as intended if dat
has an odd number of columns ...)
CodePudding user response:
Update: Improved coding!
library(dplyr)
library(tidyr)
df %>%
mutate(rn = row_number()) %>%
pivot_longer(cols = -rn, names_to = c('grp', '.value'),
names_sep = "_") %>%
group_by(grp) %>%
transmute(rn, STATUS=ifelse(A1==A2, 1,0)) %>%
pivot_wider(names_from = grp, values_from = STATUS,
names_glue = "{.value}.{grp}") %>%
select(-rn) %>%
bind_cols(df, .) %>%
relocate(1,2,7,3,4,8,5,6,9)
output:
chr1:110363793:G:C_A1 chr1:110363793:G:C_A2 STATUS.chr1:110363793:G:C chr1:110363823:A:G_A1 chr1:110363823:A:G_A2 STATUS.chr1:110363823:A:G chr1:110363849:A:G_A1 chr1:110363849:A:G_A2 STATUS.chr1:110363849:A:G
4_653_99 1 2 0 2 2 1 2 2 1
4_677_99 2 1 0 2 2 1 2 2 1
25_9_172 2 2 1 2 2 1 2 1 0
27_135_62706 2 2 1 2 2 1 2 2 1
First answer:
Here is a dplyr
solution: Drawback the new columns have to be renamed by hand:
library(dplyr)
library(tidyr)
df %>%
mutate(rn = row_number()) %>%
pivot_longer(cols = -rn, names_to = c('grp', '.value'),
names_sep = "_") %>%
group_by(grp) %>%
transmute(rn, new=ifelse(A1==A2, 1,0)) %>%
pivot_wider(names_from = grp, values_from = new) %>%
select(-rn) %>%
bind_cols(df, .)
chr1:110363793:G:C_A1 chr1:110363793:G:C_A2 chr1:110363823:A:G_A1 chr1:110363823:A:G_A2 chr1:110363849:A:G_A1 chr1:110363849:A:G_A2 chr1:110363793:G:C chr1:110363823:A:G chr1:110363849:A:G
4_653_99 1 2 2 2 2 2 0 1 1
4_677_99 2 1 2 2 2 2 0 1 1
25_9_172 2 2 2 2 2 1 1 1 0
27_135_62706 2 2 2 2 2 2 1 1 1
CodePudding user response:
An easy base R solution with sapply:
new <- sapply( (1:ncol(df))[1:ncol(df)%%2==1],
function(x) as.numeric(df[,x]!=df[,(x 1)] ) )
colnames(new) <- paste( unique(sub("_.*","",colnames(df))), "Status", sep="_" )
gives:
data.frame( df, new )
chr1.110363793.G.C_A1 chr1.110363793.G.C_A2 chr1.110363823.A.G_A1
4_653_99 1 2 2
4_677_99 2 1 2
25_9_172 2 2 2
27_135_62706 2 2 2
chr1.110363823.A.G_A2 chr1.110363849.A.G_A1 chr1.110363849.A.G_A2
4_653_99 2 2 2
4_677_99 2 2 2
25_9_172 2 2 1
27_135_62706 2 2 2
chr1.110363793.G.C_Status chr1.110363823.A.G_Status
4_653_99 1 0
4_677_99 1 0
25_9_172 0 0
27_135_62706 0 0
chr1.110363849.A.G_Status
4_653_99 0
4_677_99 0
25_9_172 1
27_135_62706 0