I have a few large dataframes in RStudio, that have this structure:
structure(list(CHROM = c("scaffold1000|size223437", "scaffold1000|size223437",
"scaffold1000|size223437", "scaffold1000|size223437"), POS = c(666,
1332, 3445, 4336), REF = c("A", "TA", "CTTGA", "GCTA"), RO = c(20,
14, 9, 25), ALT_1 = c("GAT", "TGC", "AGC", "T"), ALT_2 = c("CAG",
"TGA", "CGC", NA), ALT_3 = c("G", NA, "TGA", NA), ALT_4 = c("AGT",
NA, NA, NA), AO_1 = c(13, 4, 67, 120), AO_2 = c(12, 5, 34, NA
), AO_3 = c(6, NA, 18, NA), AO_4 = c(101, NA, NA, NA), AOF_1 = c(8.55263157894737,
17.3913043478261, 52.34375, 82.7586206896552), AOF_2 = c(7.89473684210526,
21.7391304347826, 26.5625, NA), AOF_3 = c(3.94736842105263, NA,
14.0625, NA), AOF_4 = c(66.4473684210526, NA, NA, NA)), class = "data.frame", row.names = c(NA,
-4L))
But for an analysis I need it to look like this:
structure(list(CHROM = c("scaffold1000|size223437", "scaffold1000|size223437",
"scaffold1000|size223437", "scaffold1000|size223437"), POS = c(666,
1332, 3445, 4336), REF = c("A", "TA", "CTTGA", "GCTA"), RO = c(20,
14, 9, 25), ALT_1 = c("AGT", "TGA", "AGC", "T"), ALT_2 = c("CAG",
"TGC", "CGC", NA), ALT_3 = c("G", NA, "TGA", NA), ALT_4 = c("GAT",
NA, NA, NA), AO_1 = c(101, 5, 67, 120), AO_2 = c(12, 4, 34, NA
), AO_3 = c(6, NA, 18, NA), AO_4 = c(13, NA, NA, NA), AOF_1 = c(66.4473684210526,
21.7391304347826, 52.34375, 82.7586206896552), AOF_2 = c(7.89473684210526,
17.3913043478261, 26.5625, NA), AOF_3 = c(3.94736842105263, NA,
14.0625, NA), AOF_4 = c(8.55263157894737, NA, NA, NA)), class = "data.frame", row.names = c(NA,
-4L))
So what I would like to do is to rearrange the content of a row in a way, that the columns ALT_1, ALT_2, ALT_3, ALT_4 are alphabetically sorted, but at the same time I also need to rearrange the corresponding columns of AO and AOF, so that the values still match. (The value of AO_1 should still match with the sequence that was in ALT_1. So if ALT_1 becomes ALT_2 in the sorted dataframe, AO_1 should also become AO_2)
What I tried so far, but didn't work:
Pasting the values of ALT_1, AO_1, AOF_1 all in one field, so I have them together with
if (is.na(X[i,6]) == FALSE) {
X[i,6] <- paste(X[i,6],X[i,10],X[i,14],sep=" ")
}
}
And then I wanted to extract every row as a vector to sort the values and put it back in the dataframe, but I didn't manage to do this.
So the question would be how I can order the dataframe to get the desired output? (I need to apply this to 32 dataframes with each having >100.000 values)
CodePudding user response:
here is a data.table
approach
library(data.table)
# Set to data.table format
setDT(mydata)
# Melt to long format
DT.melt <- melt(mydata, measure.vars = patterns(ALT = "^ALT_", AO = "^AO_", AOF = "^AOF_"))
# order by groups, na's at the end
setorderv(DT.melt, cols = c("CHROM", "POS", "ALT"), na.last = TRUE)
# cast to wide again, use rowid() for numbering
dcast(DT.melt, CHROM POS REF RO ~ rowid(REF), value.var = list("ALT", "AO", "AOF"))
# CHROM POS REF RO ALT_1 ALT_2 ALT_3 ALT_4 AO_1 AO_2 AO_3 AO_4 AOF_1 AOF_2 AOF_3 AOF_4
# 1: scaffold1000|size223437 666 A 20 AGT CAG G GAT 101 12 6 13 66.44737 7.894737 3.947368 8.552632
# 2: scaffold1000|size223437 1332 TA 14 TGA TGC <NA> <NA> 5 4 NA NA 21.73913 17.391304 NA NA
# 3: scaffold1000|size223437 3445 CTTGA 9 AGC CGC TGA <NA> 67 34 18 NA 52.34375 26.562500 14.062500 NA
# 4: scaffold1000|size223437 4336 GCTA 25 T <NA> <NA> <NA> 120 NA NA NA 82.75862 NA NA NA