I have a tab-delimited file that looks like this:
SampleID | dbSNP | Min.alle | M.zygo | Sample1 | Sample2 | Sample3 |
---|---|---|---|---|---|---|
311 | rs1490413 | A | Homo | G | A | G |
311 | rs730123 | G | Homo | A | G | A |
311 | rs7532151 | A | Homo | A | C | C |
311 | rs1434369 | G | Homo | T | G | T |
311 | rs1563172 | T | Homo | T | C | C |
the number of samples and thus columns $5-$i is theoretically unlimited.
I want to count the total number of occurrences of the same letter in the same row between $3 with $5 and then with $6 and then with $7 etc. and divide the resulting values by the total number of rows (except header)
so far, I could do it separately for each pair e.g. for $3 and $5 like this:
awk 'BEGIN {
FS = OFS = "\t"
}
$3 == $5 {
count
}
END {
print count/(NR-1)
}
' infile
I would like to do it in a loop and get an output like this:
SampleID | Sample1 | Sample2 | Sample3 |
---|---|---|---|
311 | 0.4 | 0.6 | 0 |
Can someone help, please?
CodePudding user response:
Here is one potential solution:
awk '
BEGIN{
OFS="\t"; printf "%s\t", "SampleID"
}
NR==1{
for(i=5;i<=NF;i ){
printf "Sample%s\t", (i-4)
}
}
NR>1{
sample[$1]
!sampleID[$1]
for(i=5;i<=NF;i ){
if($3 == $i){
count[$1, i]
}
}
}
END{
for (j in sampleID) {
print ""
printf "%s\t", j
for(i=5;i<=NF;i ){
printf "%s\t", count[j, i] / sample[j]
}
}
}' inputfile
SampleID Sample1 Sample2 Sample3
311 0.4 0.6 0
Rather than divide by (NR-1), the values are divided by the number of SampleID rows. So, if you have other sampleIDs in the file:
cat test.txt
SampleID dbSNP Min.alle M.zygo Sample1 Sample2 Sample3
311 rs1490413 A Homo G A G
311 rs730123 G Homo A G A
311 rs7532151 A Homo A C C
311 rs1434369 G Homo T G T
311 rs1563172 T Homo T C C
312 rs1490413 A Homo G A G
312 rs730123 G Homo A G A
312 rs7532151 A Homo A C C
312 rs1434369 G Homo T G T
312 rs1563172 G Homo T C C
awk '
BEGIN{
OFS="\t"; printf "%s\t", "SampleID"
}
NR==1{
for(i=5;i<=NF;i ){
printf "Sample%s\t", (i-4)
}
}
NR>1{
sample[$1]
!sampleID[$1]
for(i=5;i<=NF;i ){
if($3 == $i){
count[$1, i]
}
}
}
END{
for (j in sampleID) {
print ""
printf "%s\t", j
for(i=5;i<=NF;i ){
printf "%s\t", count[j, i] / sample[j]
}
}
}' test.txt
SampleID Sample1 Sample2 Sample3
311 0.4 0.6 0
312 0.2 0.6 0
Depending on the size of the file/s it might be worth looking at other languages for this task, i.e. this is relatively trivial in R:
library(dplyr)
df <- read.table(text = "SampleID dbSNP Min.alle M.zygo Sample1 Sample2 Sample3
311 rs1490413 A Homo G A G
311 rs730123 G Homo A G A
311 rs7532151 A Homo A C C
311 rs1434369 G Homo T G T
311 rs1563172 T Homo T C C", header = TRUE)
df %>%
group_by(SampleID) %>%
summarise(across(starts_with("Sample"), ~mean(.x == Min.alle)))
#> # A tibble: 1 × 4
#> SampleID Sample1 Sample2 Sample3
#> <int> <dbl> <dbl> <dbl>
#> 1 311 0.4 0.6 0
Edit
To print the column names (instead of "Sample_n") you could use:
awk '
BEGIN{
OFS="\t"; printf "%s\t", "SampleID"
}
NR==1{
for(i=5;i<=NF;i ){
printf "%s\t", $i
}
}
NR>1{
sample[$1]
!sampleID[$1]
for(i=5;i<=NF;i ){
if($3 == $i){
count[$1, i]
}
}
}
END{
for (j in sampleID) {
print ""
printf "%s\t", j
for(i=5;i<=NF;i ){
printf "%s\t", count[j, i] / sample[j]
}
}
}' inputfile