Home > Software engineering >  Load specific columns of all the files inside a folder, and merge them in the same dataframe
Load specific columns of all the files inside a folder, and merge them in the same dataframe

Time:06-13

I have 100 files of a format called .out.tab.tpm that correspond to 100 patients.

This is a .txt file that can be read normally with fread function, and that looks like this:

gene_id    tpm      count    V4
ENG2023    234      4        152
ENG2048    0        0        1
ENG2087    2        65       6

gene_id is the column that specifies the gene (gene id), and all the files have the same values for this column. tpm is the measurement I'm interested in.

What I want is to load each of these files, getting the gene_id and tpm column, and merging by gene_id. And then tranpose, so the patients will be the row names, and the gene the column names.

Something like the following:

PatientID   ENG2023   ENG2048   ENG2087
Patient1    234       0         2
Patient2    123       0         34
Patient3    36        2         1
Patient4    89        0         6
Patient5    456       1         12

You can do this for two files manually with:

df1 <- fread("C:/Users/Data/Patient1_ReadsPerGene.out.tab.tpm", select = c("gene_id", "tpm"))
df2 <- fread("C:/Users/Data/Patient2_ReadsPerGene.out.tab.tpm", select = c("gene_id", "tpm"))

df3 <- merge(df1, df2, by="gene_id")

I tried creating a loop to do this:

setwd("C:/Users/")

#create a list of the files from your target directory
file_list <- list.files(path="C:/Users/Data/")

#initiate a blank data frame, each iteration of the loop will append the data from the given file to this variable
dataset <- data.frame()

for (i in 1:length(file_list)){
  temp_data <- fread(file_list[i], select = c("gene_id", "tpm")) 
  dataset <- merge(dataset, temp_data, by="gene_id") 
}

However, I get the following error:

Error in fix.by(by.x, x) : 'by' must specify a uniquely valid column

emphasized text

CodePudding user response:

I propose this:

library(data.table)

patients <- c("patient1", "patienta", "patienttoto")

all.files <- paste0(patients, ".out.tab.tpm")

l <- lapply(all.files, fread)
names(l) <- patients

for (n in 1:length(patients)) {
  l[[n]][, PatientID := patients[n]]
}

dt <- rbindlist(l)
dcast(dt, PatientID ~ gene_id, value.var="tpm")
  • Related