Home > Software design >  AWK looping through columns to count matches
AWK looping through columns to count matches

Time:12-26

I have a tab-delimited file that looks like this:

SampleID dbSNP Min.alle M.zygo Sample1 Sample2 Sample3
311 rs1490413 A Homo G A G
311 rs730123 G Homo A G A
311 rs7532151 A Homo A C C
311 rs1434369 G Homo T G T
311 rs1563172 T Homo T C C

the number of samples and thus columns $5-$i is theoretically unlimited.

I want to count the total number of occurrences of the same letter in the same row between $3 with $5 and then with $6 and then with $7 etc. and divide the resulting values by the total number of rows (except header)

so far, I could do it separately for each pair e.g. for $3 and $5 like this:

awk 'BEGIN {
    FS = OFS = "\t"
}

$3 == $5 {
      count
}

END {
    print count/(NR-1)
}

' infile

I would like to do it in a loop and get an output like this:

SampleID Sample1 Sample2 Sample3
311 0.4 0.6 0

Can someone help, please?

CodePudding user response:

Here is one potential solution:

awk '
BEGIN{
    OFS="\t"; printf "%s\t", "SampleID"
}
NR==1{
    for(i=5;i<=NF;i  ){
        printf "Sample%s\t", (i-4)
    }
}
NR>1{
    sample[$1]  
    !sampleID[$1]  
    for(i=5;i<=NF;i  ){
        if($3 == $i){
            count[$1, i]  
        }
    }
}
END{
    for (j in sampleID) {
        print ""
        printf "%s\t", j
        for(i=5;i<=NF;i  ){
            printf "%s\t", count[j, i] / sample[j]
        }
    }
}' inputfile

SampleID  Sample1  Sample2  Sample3
311       0.4      0.6      0

Rather than divide by (NR-1), the values are divided by the number of SampleID rows. So, if you have other sampleIDs in the file:

cat test.txt
SampleID    dbSNP   Min.alle    M.zygo  Sample1 Sample2 Sample3
311 rs1490413   A   Homo    G   A   G
311 rs730123    G   Homo    A   G   A
311 rs7532151   A   Homo    A   C   C
311 rs1434369   G   Homo    T   G   T
311 rs1563172   T   Homo    T   C   C
312 rs1490413   A   Homo    G   A   G
312 rs730123    G   Homo    A   G   A
312 rs7532151   A   Homo    A   C   C
312 rs1434369   G   Homo    T   G   T
312 rs1563172   G   Homo    T   C   C

awk '
BEGIN{
    OFS="\t"; printf "%s\t", "SampleID"
}
NR==1{
    for(i=5;i<=NF;i  ){
        printf "Sample%s\t", (i-4)
    }
}
NR>1{
    sample[$1]  
    !sampleID[$1]  
    for(i=5;i<=NF;i  ){
        if($3 == $i){
            count[$1, i]  
        }
    }
}
END{
    for (j in sampleID) {
        print ""
        printf "%s\t", j
        for(i=5;i<=NF;i  ){
            printf "%s\t", count[j, i] / sample[j]
        }
    }
}' test.txt
SampleID    Sample1 Sample2 Sample3
311         0.4     0.6     0
312         0.2     0.6     0

Depending on the size of the file/s it might be worth looking at other languages for this task, i.e. this is relatively trivial in R:

library(dplyr)

df <- read.table(text = "SampleID   dbSNP   Min.alle    M.zygo  Sample1 Sample2 Sample3
311 rs1490413   A   Homo    G   A   G
311 rs730123    G   Homo    A   G   A
311 rs7532151   A   Homo    A   C   C
311 rs1434369   G   Homo    T   G   T
311 rs1563172   T   Homo    T   C   C", header = TRUE)

df %>%
  group_by(SampleID) %>%
  summarise(across(starts_with("Sample"), ~mean(.x == Min.alle)))
#> # A tibble: 1 × 4
#>   SampleID Sample1 Sample2 Sample3
#>      <int>   <dbl>   <dbl>   <dbl>
#> 1      311     0.4     0.6       0

Edit

To print the column names (instead of "Sample_n") you could use:

awk '
BEGIN{
    OFS="\t"; printf "%s\t", "SampleID"
}
NR==1{
    for(i=5;i<=NF;i  ){
        printf "%s\t", $i
    }
}
NR>1{
    sample[$1]  
    !sampleID[$1]  
    for(i=5;i<=NF;i  ){
        if($3 == $i){
            count[$1, i]  
        }
    }
}
END{
    for (j in sampleID) {
        print ""
        printf "%s\t", j
        for(i=5;i<=NF;i  ){
            printf "%s\t", count[j, i] / sample[j]
        }
    }
}' inputfile
  • Related