Home > Net >  Get the frequency of the elements in a column
Get the frequency of the elements in a column

Time:11-10

I have a directory with several files tab-delimited (txt format). Each file is a table with a lot of rows, but I am interested in the 10th column. I want to extract all uniq values of this column and count the occurrence for all files. For this purpose, I have used the code below in bash, which works, but when the files are extremely heavy, it does not respond.

awk -F'\t' 'FNR==1{next}
{
    n[$10];
  if ($10 in a) {
    a[$10]=a[$10]","ARGV[ARGIND]
  }else{
    a[$10]=ARGV[ARGIND]
  }
}
END{
printf("%-24s %6s    %s\n","Variant","Nº of repeats","Location");
for (v in n) printf("%-24s m    %s\n",v,n[v],a[v])}' * > var_freq.txt

The output should see something like this:

Variant     Nº of repeats     Location
variant1    3     file1,file2,file3
variant2    5     file1,file3,file4,file5,file7
variant3    2     file1,file4

The header of a possible file looks like this:

Chr     Start     END    COL4     COL5     COL6     COL7     COL8     COL9     Variant
1     1234     1345    ABC     123     234     345     456     567     c.1236A>G

Anyone could tell me how to do that in other languages that can handle BigData? (python, R...) Thanks

CodePudding user response:

In R

Load these packages:

library(dplyr)
library(purrr)
library(tibble)

Then, given that you have a vector files with your file names, you can find the counts of the variants with:

files %>%
  purrr::map_dfr(
    \(file) {
      read.delim(file) %>%
        select(10) %>%
        table() %>%
        enframe(name = "Variant", value = "N") %>%
        mutate(Location = file)
    }
  ) %>%
  group_by(Variant) %>%
  summarize(`N° of repeats` = sum(N),
            Location = paste0(Location, collapse = ","),
            .groups = "drop")

You can install the dependencies from the R console with:

install.packages(c("dplyr", "purrr", "tibble"))

CodePudding user response:

Here is another option in R:

library(data.table)

file_list <- list.files(pattern = "\\.txt")

myData <- lapply(file_list, \(x){
  dat <- fread(x, select = 10) |>
    table() |>
    data.table()
  dat[,file := sub("\\.txt", "", x)]
}) |>
  rbindlist()

final <- myData[, .(`Nº of repeats` = sum(N), Location = paste(file, collapse = ",")),
            by = list(Variant = V1)]

print(final)

CodePudding user response:

In base R

mydir <- "/location/of/my/dir"
files <- list.files(filedir, pattern = "*.txt", full.names = TRUE)

variants <- lapply(seq_along(files), \(i) data.frame(
  "Location" = basename(files[i]),
  "Variant" = unique(read.delim(files[i])[,10])
))

output <- aggregate(Location~Variant, do.call(rbind, variants), paste, sep=",")
output$Nrepeats <- nchar(gsub("[^,]", "", output$Location))   1

> output
   Variant                        Location Nrepeats
1 variant1            file1.csv, file3.csv        2
2 variant2            file1.csv, file2.csv        2
3 variant3                       file1.csv        1
4 variant4 file1.csv, file2.csv, file3.csv        3
5 variant5                       file2.csv        1
6 variant6            file2.csv, file3.csv        2
7 variant9                       file3.csv        1
  1. Create a vector of filenames
  2. For each filename, create a data frame like this:
> variants[[1]]
   Location  Variant
1 file1.csv variant1
2 file1.csv variant2
3 file1.csv variant3
4 file1.csv variant4
  1. Stick all the data frames together and aggregate by Variant (you're left with one row per variant)
  2. Count the number of filenames to populate Nrepeats (this is a bit inefficient and you could use two separate aggregations or tapplys instead)
  • Related