Get the frequency of the elements in a column-CodePudding

I have a directory with several files tab-delimited (txt format). Each file is a table with a lot of rows, but I am interested in the 10th column. I want to extract all uniq values of this column and count the occurrence for all files. For this purpose, I have used the code below in bash, which works, but when the files are extremely heavy, it does not respond.

awk -F'\t' 'FNR==1{next}
{
    n[$10];
  if ($10 in a) {
    a[$10]=a[$10]","ARGV[ARGIND]
  }else{
    a[$10]=ARGV[ARGIND]
  }
}
END{
printf("%-24s %6s    %s\n","Variant","Nº of repeats","Location");
for (v in n) printf("%-24s m    %s\n",v,n[v],a[v])}' * > var_freq.txt

The output should see something like this:

Variant     Nº of repeats     Location
variant1    3     file1,file2,file3
variant2    5     file1,file3,file4,file5,file7
variant3    2     file1,file4

The header of a possible file looks like this:

Chr     Start     END    COL4     COL5     COL6     COL7     COL8     COL9     Variant
1     1234     1345    ABC     123     234     345     456     567     c.1236A>G

Anyone could tell me how to do that in other languages that can handle BigData? (python, R...) Thanks

CodePudding user response：

In R

Load these packages:

library(dplyr)
library(purrr)
library(tibble)

Then, given that you have a vector files with your file names, you can find the counts of the variants with:

files %>%
  purrr::map_dfr(
    \(file) {
      read.delim(file) %>%
        select(10) %>%
        table() %>%
        enframe(name = "Variant", value = "N") %>%
        mutate(Location = file)
    }
  ) %>%
  group_by(Variant) %>%
  summarize(`N° of repeats` = sum(N),
            Location = paste0(Location, collapse = ","),
            .groups = "drop")

You can install the dependencies from the R console with:

install.packages(c("dplyr", "purrr", "tibble"))

CodePudding user response：

Here is another option in R:

library(data.table)

file_list <- list.files(pattern = "\\.txt")

myData <- lapply(file_list, \(x){
  dat <- fread(x, select = 10) |>
    table() |>
    data.table()
  dat[,file := sub("\\.txt", "", x)]
}) |>
  rbindlist()

final <- myData[, .(`Nº of repeats` = sum(N), Location = paste(file, collapse = ",")),
            by = list(Variant = V1)]

print(final)

CodePudding user response：

In base R

mydir <- "/location/of/my/dir"
files <- list.files(filedir, pattern = "*.txt", full.names = TRUE)

variants <- lapply(seq_along(files), \(i) data.frame(
  "Location" = basename(files[i]),
  "Variant" = unique(read.delim(files[i])[,10])
))

output <- aggregate(Location~Variant, do.call(rbind, variants), paste, sep=",")
output$Nrepeats <- nchar(gsub("[^,]", "", output$Location))   1

> output
   Variant                        Location Nrepeats
1 variant1            file1.csv, file3.csv        2
2 variant2            file1.csv, file2.csv        2
3 variant3                       file1.csv        1
4 variant4 file1.csv, file2.csv, file3.csv        3
5 variant5                       file2.csv        1
6 variant6            file2.csv, file3.csv        2
7 variant9                       file3.csv        1

Create a vector of filenames
For each filename, create a data frame like this:

> variants[[1]]
   Location  Variant
1 file1.csv variant1
2 file1.csv variant2
3 file1.csv variant3
4 file1.csv variant4

Stick all the data frames together and aggregate by Variant (you're left with one row per variant)
Count the number of filenames to populate Nrepeats (this is a bit inefficient and you could use two separate aggregations or tapplys instead)