I have 100 files of a format called .out.tab.tpm
that correspond to 100 patients.
This is a .txt file that can be read normally with fread
function, and that looks like this:
gene_id tpm count V4
ENG2023 234 4 152
ENG2048 0 0 1
ENG2087 2 65 6
gene_id
is the column that specifies the gene (gene id), and all the files have the same values for this column.
tpm
is the measurement I'm interested in.
What I want is to load each of these files, getting the gene_id
and tpm
column, and merging by gene_id
. And then tranpose, so the patients will be the row names, and the gene the column names.
Something like the following:
PatientID ENG2023 ENG2048 ENG2087
Patient1 234 0 2
Patient2 123 0 34
Patient3 36 2 1
Patient4 89 0 6
Patient5 456 1 12
You can do this for two files manually with:
df1 <- fread("C:/Users/Data/Patient1_ReadsPerGene.out.tab.tpm", select = c("gene_id", "tpm"))
df2 <- fread("C:/Users/Data/Patient2_ReadsPerGene.out.tab.tpm", select = c("gene_id", "tpm"))
df3 <- merge(df1, df2, by="gene_id")
I tried creating a loop to do this:
setwd("C:/Users/")
#create a list of the files from your target directory
file_list <- list.files(path="C:/Users/Data/")
#initiate a blank data frame, each iteration of the loop will append the data from the given file to this variable
dataset <- data.frame()
for (i in 1:length(file_list)){
temp_data <- fread(file_list[i], select = c("gene_id", "tpm"))
dataset <- merge(dataset, temp_data, by="gene_id")
}
However, I get the following error:
Error in fix.by(by.x, x) : 'by' must specify a uniquely valid column
emphasized text
CodePudding user response:
I propose this:
library(data.table)
patients <- c("patient1", "patienta", "patienttoto")
all.files <- paste0(patients, ".out.tab.tpm")
l <- lapply(all.files, fread)
names(l) <- patients
for (n in 1:length(patients)) {
l[[n]][, PatientID := patients[n]]
}
dt <- rbindlist(l)
dcast(dt, PatientID ~ gene_id, value.var="tpm")