I have several data frame that have a list of gene names without a header. Each files roughly looks like this:
SCA-6_Chr1v1_00001
SCA-6_Chr1v1_00002
SCA-6_Chr1v1_00003
SCA-6_Chr1v1_00004
SCA-6_Chr1v1_00005
SCA-6_Chr1v1_00006
SCA-6_Chr1v1_00009
SCA-6_Chr1v1_00010
SCA-6_Chr1v1_00014
SCA-6_Chr1v1_00015
SCA-6_Chr1v1_00017
Each of these data frames is written to a separate .txt
file and I have uploaded them all into one list like so:
temp = list.files(pattern = "*.txt")
myfiles = lapply(temp, FUN=read.table, header=FALSE)
With the myfiles
list I want to compare all of the data frames against each other and find values unique to each file and return them in a list where each data frame in the new list only has those characters not found in any other list (I assume I can do this with a lapply
function). I have tried running the following code but it is not dropping the shared values:
unique.genes = lapply(1:length(myfiles), function(n) setdiff(myfiles[[n]], unlist(myfiles[-n])))
Any help would be greatly appreciated.
CodePudding user response:
Here is an approach. First, provide reproducible data:
set.seed(42)
myfiles <- replicate(2, sample(LETTERS, 25, replace=TRUE), simplify=FALSE)
myfiles
# [[1]]
# [1] "Q" "E" "A" "Y" "J" "D" "R" "Z" "Q" "O" "X" "G" "D" "Y" "E" "N" "T" "Z" "R" "O" "C" "I" "Y" "D" "E"
#
# [[2]]
# [1] "M" "E" "T" "B" "H" "C" "Z" "A" "J" "X" "K" "O" "V" "Z" "H" "D" "D" "V" "R" "M" "E" "D" "B" "X" "R"
Now find the unique values:
result <- lapply(myfiles, unique)
result
# [[1]]
# [1] "Q" "E" "A" "Y" "J" "D" "R" "Z" "O" "X" "G" "N" "T" "C" "I"
#
# [[2]]
# [1] "M" "E" "T" "B" "H" "C" "Z" "A" "J" "X" "K" "O" "V" "D" "R"
Or this will sort them for easier comparison:
result2 <- lapply(myfiles, function(x) sort(unique(x)))
CodePudding user response:
Here is a way.
- Start by reading in the data with
scan
. This will create vectors, not data.frames, which have a much slower access time. - Then the
lapply/setdiff
will keep the unique values in each vector.
set.seed(2022)
myfiles <- replicate(10, unique(sample(c(LETTERS, 0:9, letters), 10, replace = TRUE)), simplify = FALSE)
l <- lapply(seq_along(myfiles), \(i) {write.table(myfiles[[i]],
sprintf("testd.txt", i),
row.names = FALSE,
col.names = FALSE,
quote = FALSE)})
rm(l)
temp <- list.files(pattern = "*.txt")
myfiles <- lapply(temp, FUN = read.table, header = FALSE)
myfiles2 <- lapply(temp, FUN = scan, what = character())
unique.genes <- lapply(1:length(myfiles), function(n) setdiff(myfiles[[n]][[1]], unlist(myfiles[-n])))
unique.genes2 <- lapply(1:length(myfiles2), function(n) setdiff(myfiles2[[n]], unlist(myfiles2[-n])))
identical(unique.genes, unique.genes2)
#> [1] TRUE
library(microbenchmark)
mb <- microbenchmark(
read.table = lapply(1:length(myfiles), function(n) setdiff(myfiles[[n]][[1]], unlist(myfiles[-n]))),
scan = lapply(1:length(myfiles2), function(n) setdiff(myfiles2[[n]], unlist(myfiles2[-n])))
)
print(mb, order = "median", unit = "relative")
#> Unit: relative
#> expr min lq mean median uq max neval cld
#> scan 1.000000 1.000000 1.000000 1.000 1.000000 1.000000 100 a
#> read.table 3.048491 2.921598 2.511883 2.945 2.750842 1.002187 100 b
unlink(temp)
Created on 2022-07-28 by the reprex package (v2.0.1)