Home > Net >  reading a .tsv file in a loop gets a list of files with the same file and different names
reading a .tsv file in a loop gets a list of files with the same file and different names

Time:01-31

basically, I would to read and manipulate about 80 tsv files in the same folder. I don't understand why when I try a lapply or sapply, it reads all the .tsv files but the files in the list are identical (same nr of rows, etc) but the name of each file is different. It s like it's reading 80 times the same first file of the list.

files <- list.files(path="/my/path/tracks")

head(files)
[1] "ENCFF029UVS forebrain tissue embryo (16.5 days).tsv" "ENCFF042VCB forebrain tissue embryo (11.5 days).tsv"
[3] "ENCFF080PBH forebrain tissue embryo (15.5 days).tsv" "ENCFF081SJ Llimb tissue embryo (15.5 days).tsv"     
[5] "ENCFF110ZFH heart tissue embryo (16.5 days).tsv"     "ENCFF126VCW midbrain tissue embryo (10.5 days).tsv" 


try= sapply(files, simplify=FALSE, function(i){    
  message("reading file", i, "..." )
  df= read_tsv(file = i, )
 
  df= df[grep("ENSMUS*", df$gene_id),]
  
  df$ID=gsub("\\..*","", df$gene_id)   
})

I get this kind of result:

> head(try$`ENCFF029UVS forebrain tissue embryo (16.5 days).tsv`)
[1] "ENSMUSG00000000001" "ENSMUSG00000000003" "ENSMUSG00000000028" "ENSMUSG00000000031" "ENSMUSG00000000037" "ENSMUSG00000000049"
> head(try$`ENCFF042VCB forebrain tissue embryo (11.5 days).tsv`)
[1] "ENSMUSG00000000001" "ENSMUSG00000000003" "ENSMUSG00000000028" "ENSMUSG00000000031" "ENSMUSG00000000037" "ENSMUSG00000000049"
>head(try$`ENCFF126VCW midbrain tissue embryo (10.5 days).tsv`)
[1] "ENSMUSG00000000001" "ENSMUSG00000000003" "ENSMUSG00000000028" "ENSMUSG00000000031" "ENSMUSG00000000037" "ENSMUSG00000000049"

It's basically identical. What's wrong? thanks

CodePudding user response:

Your sapply is returning the result of the last expression, which in this case is the assignment of df$ID. That is a string vector, not a frame. Add a solitary df before the end of your sapply, and you'll get the whole frame.

try= sapply(files, simplify=FALSE, function(i){    
  message("reading file", i, "..." )
  df= read_tsv(file = i, )
  df= df[grep("ENSMUS*", df$gene_id),]
  df$ID=gsub("\\..*","", df$gene_id)   
  df                                     # ADD THIS
})

As a demonstration of how assigning a column into a frame doesn't return the whole frame, see this:

df <- data.frame(a=1:3)
ret <- (df$id <- letters[1:3])
ret
# [1] "a" "b" "c"

(This is regardless of the use of <- vs =, the behavior in R is the same.)

  • Related