AWK looping through columns to count matches-CodePudding

I have a tab-delimited file that looks like this:

SampleID	dbSNP	Min.alle	M.zygo	Sample1	Sample2	Sample3
311	rs1490413	A	Homo	G	A	G
311	rs730123	G	Homo	A	G	A
311	rs7532151	A	Homo	A	C	C
311	rs1434369	G	Homo	T	G	T
311	rs1563172	T	Homo	T	C	C

the number of samples and thus columns $5-$i is theoretically unlimited.

I want to count the total number of occurrences of the same letter in the same row between $3 with $5 and then with $6 and then with $7 etc. and divide the resulting values by the total number of rows (except header)

so far, I could do it separately for each pair e.g. for $3 and $5 like this:

awk 'BEGIN {
    FS = OFS = "\t"
}

$3 == $5 {
      count
}

END {
    print count/(NR-1)
}

' infile

I would like to do it in a loop and get an output like this:

SampleID	Sample1	Sample2	Sample3
311	0.4	0.6	0

Can someone help, please?

CodePudding user response：

Here is one potential solution:

awk '
BEGIN{
    OFS="\t"; printf "%s\t", "SampleID"
}
NR==1{
    for(i=5;i<=NF;i  ){
        printf "Sample%s\t", (i-4)
    }
}
NR>1{
    sample[$1]  
    !sampleID[$1]  
    for(i=5;i<=NF;i  ){
        if($3 == $i){
            count[$1, i]  
        }
    }
}
END{
    for (j in sampleID) {
        print ""
        printf "%s\t", j
        for(i=5;i<=NF;i  ){
            printf "%s\t", count[j, i] / sample[j]
        }
    }
}' inputfile

SampleID  Sample1  Sample2  Sample3
311       0.4      0.6      0

Rather than divide by (NR-1), the values are divided by the number of SampleID rows. So, if you have other sampleIDs in the file:

cat test.txt
SampleID    dbSNP   Min.alle    M.zygo  Sample1 Sample2 Sample3
311 rs1490413   A   Homo    G   A   G
311 rs730123    G   Homo    A   G   A
311 rs7532151   A   Homo    A   C   C
311 rs1434369   G   Homo    T   G   T
311 rs1563172   T   Homo    T   C   C
312 rs1490413   A   Homo    G   A   G
312 rs730123    G   Homo    A   G   A
312 rs7532151   A   Homo    A   C   C
312 rs1434369   G   Homo    T   G   T
312 rs1563172   G   Homo    T   C   C

awk '
BEGIN{
    OFS="\t"; printf "%s\t", "SampleID"
}
NR==1{
    for(i=5;i<=NF;i  ){
        printf "Sample%s\t", (i-4)
    }
}
NR>1{
    sample[$1]  
    !sampleID[$1]  
    for(i=5;i<=NF;i  ){
        if($3 == $i){
            count[$1, i]  
        }
    }
}
END{
    for (j in sampleID) {
        print ""
        printf "%s\t", j
        for(i=5;i<=NF;i  ){
            printf "%s\t", count[j, i] / sample[j]
        }
    }
}' test.txt
SampleID    Sample1 Sample2 Sample3
311         0.4     0.6     0
312         0.2     0.6     0

Depending on the size of the file/s it might be worth looking at other languages for this task, i.e. this is relatively trivial in R:

library(dplyr)

df <- read.table(text = "SampleID   dbSNP   Min.alle    M.zygo  Sample1 Sample2 Sample3
311 rs1490413   A   Homo    G   A   G
311 rs730123    G   Homo    A   G   A
311 rs7532151   A   Homo    A   C   C
311 rs1434369   G   Homo    T   G   T
311 rs1563172   T   Homo    T   C   C", header = TRUE)

df %>%
  group_by(SampleID) %>%
  summarise(across(starts_with("Sample"), ~mean(.x == Min.alle)))
#> # A tibble: 1 × 4
#>   SampleID Sample1 Sample2 Sample3
#>      <int>   <dbl>   <dbl>   <dbl>
#> 1      311     0.4     0.6       0

Edit

To print the column names (instead of "Sample_n") you could use:

awk '
BEGIN{
    OFS="\t"; printf "%s\t", "SampleID"
}
NR==1{
    for(i=5;i<=NF;i  ){
        printf "%s\t", $i
    }
}
NR>1{
    sample[$1]  
    !sampleID[$1]  
    for(i=5;i<=NF;i  ){
        if($3 == $i){
            count[$1, i]  
        }
    }
}
END{
    for (j in sampleID) {
        print ""
        printf "%s\t", j
        for(i=5;i<=NF;i  ){
            printf "%s\t", count[j, i] / sample[j]
        }
    }
}' inputfile