Home > Software design >  Subset rows in multiple tsv files based on column values; combine results in one df; add one column
Subset rows in multiple tsv files based on column values; combine results in one df; add one column

Time:01-02

I have >300 tsv files with the same dimensions. Column headers are the same for all files. I would like to filter each of these files to include only rows which satisfy criteria specific to one column. For example, I would want to include any rows where 'ColumnA' = 'gene1' or 'gene2' or 'gene3'. I would also like to combine the subsetted data from all files into 1 df and add an additional column so that I know which file each row came from. I am fairly new to R so I don't have experience with writing script to be applied to multiple files.

I have used the filter function before but how can this be applied to more than 1 file? Should I merge the data in all files and then filter? If so, then how do I add an additional column to each file for the file name before merging all of the files? I read about for-loops but when I tried this I got an error message: Error in file(file, "rt") : invalid 'description' argument. Any help would be greatly appreciated!

for(i in M_Data){
    dat <- read_tsv (paste("/Users/EM/Desktop/Files",i,sep="\t"))
    dat$Source <- i
    M_Data_Merge <- rbind(M_Data_Merge,dat)
  }

CodePudding user response:

data.table is your friend for this task:

 library(data.table)

 fileNames = dir("path/to/files", pattern = "*.tsv", full.names = TRUE)
 listOfTables = lapply(fileNames, fread)

 # convert all tables to a big df
 df = rbindlist(listOfTables)

 # filter only the rows you want:
 newDf = df[ColumnA %chin% c("gene1", "gene2", "gene3"), ]
  • Related