I have a directory with several files tab-delimited (txt format). Each file is a table with a lot of rows, but I am interested in the 10th column. I want to extract all uniq values of this column and count the occurrence for all files. For this purpose, I have used the code below in bash, which works, but when the files are extremely heavy, it does not respond.
awk -F'\t' 'FNR==1{next}
{
n[$10];
if ($10 in a) {
a[$10]=a[$10]","ARGV[ARGIND]
}else{
a[$10]=ARGV[ARGIND]
}
}
END{
printf("%-24s %6s %s\n","Variant","Nº of repeats","Location");
for (v in n) printf("%-24s m %s\n",v,n[v],a[v])}' * > var_freq.txt
The output should see something like this:
Variant Nº of repeats Location
variant1 3 file1,file2,file3
variant2 5 file1,file3,file4,file5,file7
variant3 2 file1,file4
The header of a possible file looks like this:
Chr Start END COL4 COL5 COL6 COL7 COL8 COL9 Variant
1 1234 1345 ABC 123 234 345 456 567 c.1236A>G
Anyone could tell me how to do that in other languages that can handle BigData? (python, R...) Thanks
CodePudding user response:
In R
Load these packages:
library(dplyr)
library(purrr)
library(tibble)
Then, given that you have a vector files
with your file names, you can find the counts of the variants with:
files %>%
purrr::map_dfr(
\(file) {
read.delim(file) %>%
select(10) %>%
table() %>%
enframe(name = "Variant", value = "N") %>%
mutate(Location = file)
}
) %>%
group_by(Variant) %>%
summarize(`N° of repeats` = sum(N),
Location = paste0(Location, collapse = ","),
.groups = "drop")
You can install the dependencies from the R console with:
install.packages(c("dplyr", "purrr", "tibble"))
CodePudding user response:
Here is another option in R:
library(data.table)
file_list <- list.files(pattern = "\\.txt")
myData <- lapply(file_list, \(x){
dat <- fread(x, select = 10) |>
table() |>
data.table()
dat[,file := sub("\\.txt", "", x)]
}) |>
rbindlist()
final <- myData[, .(`Nº of repeats` = sum(N), Location = paste(file, collapse = ",")),
by = list(Variant = V1)]
print(final)
CodePudding user response:
In base R
mydir <- "/location/of/my/dir"
files <- list.files(filedir, pattern = "*.txt", full.names = TRUE)
variants <- lapply(seq_along(files), \(i) data.frame(
"Location" = basename(files[i]),
"Variant" = unique(read.delim(files[i])[,10])
))
output <- aggregate(Location~Variant, do.call(rbind, variants), paste, sep=",")
output$Nrepeats <- nchar(gsub("[^,]", "", output$Location)) 1
> output
Variant Location Nrepeats
1 variant1 file1.csv, file3.csv 2
2 variant2 file1.csv, file2.csv 2
3 variant3 file1.csv 1
4 variant4 file1.csv, file2.csv, file3.csv 3
5 variant5 file2.csv 1
6 variant6 file2.csv, file3.csv 2
7 variant9 file3.csv 1
- Create a vector of filenames
- For each filename, create a data frame like this:
> variants[[1]]
Location Variant
1 file1.csv variant1
2 file1.csv variant2
3 file1.csv variant3
4 file1.csv variant4
- Stick all the data frames together and aggregate by Variant (you're left with one row per variant)
- Count the number of filenames to populate Nrepeats (this is a bit inefficient and you could use two separate aggregations or tapplys instead)