I have >300 tsv files with the same dimensions. Column headers are the same for all files. I would like to filter each of these files to include only rows which satisfy criteria specific to one column. For example, I would want to include any rows where 'ColumnA' = 'gene1' or 'gene2' or 'gene3'. I would also like to combine the subsetted data from all files into 1 df and add an additional column so that I know which file each row came from. I am fairly new to R so I don't have experience with writing script to be applied to multiple files.
I have used the filter function before but how can this be applied to more than 1 file? Should I merge the data in all files and then filter? If so, then how do I add an additional column to each file for the file name before merging all of the files? I read about for-loops but when I tried this I got an error message: Error in file(file, "rt") : invalid 'description' argument
. Any help would be greatly appreciated!
for(i in M_Data){
dat <- read_tsv (paste("/Users/EM/Desktop/Files",i,sep="\t"))
dat$Source <- i
M_Data_Merge <- rbind(M_Data_Merge,dat)
}
CodePudding user response:
data.table is your friend for this task:
library(data.table)
fileNames = dir("path/to/files", pattern = "*.tsv", full.names = TRUE)
listOfTables = lapply(fileNames, fread)
# convert all tables to a big df
df = rbindlist(listOfTables)
# filter only the rows you want:
newDf = df[ColumnA %chin% c("gene1", "gene2", "gene3"), ]