Home > Enterprise >  Determine Differences between Items in a List
Determine Differences between Items in a List

Time:07-29

I have several data frame that have a list of gene names without a header. Each files roughly looks like this:

SCA-6_Chr1v1_00001
SCA-6_Chr1v1_00002
SCA-6_Chr1v1_00003
SCA-6_Chr1v1_00004
SCA-6_Chr1v1_00005
SCA-6_Chr1v1_00006
SCA-6_Chr1v1_00009
SCA-6_Chr1v1_00010
SCA-6_Chr1v1_00014
SCA-6_Chr1v1_00015
SCA-6_Chr1v1_00017

Each of these data frames is written to a separate .txt file and I have uploaded them all into one list like so:

temp = list.files(pattern = "*.txt")
myfiles = lapply(temp, FUN=read.table, header=FALSE)

With the myfiles list I want to compare all of the data frames against each other and find values unique to each file and return them in a list where each data frame in the new list only has those characters not found in any other list (I assume I can do this with a lapply function). I have tried running the following code but it is not dropping the shared values:

unique.genes = lapply(1:length(myfiles), function(n) setdiff(myfiles[[n]], unlist(myfiles[-n])))

Any help would be greatly appreciated.

CodePudding user response:

Here is an approach. First, provide reproducible data:

set.seed(42)
myfiles <- replicate(2, sample(LETTERS, 25, replace=TRUE), simplify=FALSE)
myfiles
# [[1]]
#  [1] "Q" "E" "A" "Y" "J" "D" "R" "Z" "Q" "O" "X" "G" "D" "Y" "E" "N" "T" "Z" "R" "O" "C" "I" "Y" "D" "E"
# 
# [[2]]
#  [1] "M" "E" "T" "B" "H" "C" "Z" "A" "J" "X" "K" "O" "V" "Z" "H" "D" "D" "V" "R" "M" "E" "D" "B" "X" "R"

Now find the unique values:

result <- lapply(myfiles, unique)
result
# [[1]]
#  [1] "Q" "E" "A" "Y" "J" "D" "R" "Z" "O" "X" "G" "N" "T" "C" "I"
# 
# [[2]]
#  [1] "M" "E" "T" "B" "H" "C" "Z" "A" "J" "X" "K" "O" "V" "D" "R"

Or this will sort them for easier comparison:

result2 <- lapply(myfiles, function(x) sort(unique(x)))

CodePudding user response:

Here is a way.

  • Start by reading in the data with scan. This will create vectors, not data.frames, which have a much slower access time.
  • Then the lapply/setdiff will keep the unique values in each vector.
set.seed(2022)
myfiles <- replicate(10, unique(sample(c(LETTERS, 0:9, letters), 10, replace = TRUE)), simplify = FALSE)
l <- lapply(seq_along(myfiles), \(i) {write.table(myfiles[[i]], 
                                             sprintf("testd.txt", i),
                                             row.names = FALSE,
                                             col.names = FALSE,
                                             quote = FALSE)})
rm(l)

temp <- list.files(pattern = "*.txt")
myfiles <- lapply(temp, FUN = read.table, header = FALSE)
myfiles2 <- lapply(temp, FUN = scan, what = character())

unique.genes <- lapply(1:length(myfiles), function(n) setdiff(myfiles[[n]][[1]], unlist(myfiles[-n])))
unique.genes2 <- lapply(1:length(myfiles2), function(n) setdiff(myfiles2[[n]], unlist(myfiles2[-n])))

identical(unique.genes, unique.genes2)
#> [1] TRUE

library(microbenchmark)
mb <- microbenchmark(
  read.table = lapply(1:length(myfiles), function(n) setdiff(myfiles[[n]][[1]], unlist(myfiles[-n]))),
  scan = lapply(1:length(myfiles2), function(n) setdiff(myfiles2[[n]], unlist(myfiles2[-n])))
)
print(mb, order = "median", unit = "relative")
#> Unit: relative
#>        expr      min       lq     mean median       uq      max neval cld
#>        scan 1.000000 1.000000 1.000000  1.000 1.000000 1.000000   100  a 
#>  read.table 3.048491 2.921598 2.511883  2.945 2.750842 1.002187   100   b

unlink(temp)

Created on 2022-07-28 by the reprex package (v2.0.1)

  • Related