AWK loop to parse file-CodePudding

I have trouble understandig an awk command which I want to change slightly (but can't because I don't understand the code enough). The result of this awk command is to parse together text files having 6 columns (always same columns order but containing potentially different data).

First, I would like to only parse some specific columns from these files and not all 6. I couldn't figure out where to specify it in the awk loop.

Secondly, the header of the columns are not the first row of the output file anymore. It would be nice to have it as header in the output file as well.

Thirdly, I need to know from which file the data comes from. I know that the command take the files in the order they appear when doing ls -lh *mosdepth.summary.txt so I can deduce that the first 6 columns are from file 1, the 6 next from file 2, ect. However, I would like to automatically have this information in the output file to reduce the potential human errors I can do by infering the origin of the data.

Here is the awk command

awk -F"\t" -v OFS="\t" 'F!=FILENAME { FNUM  ; F=FILENAME }

{       COL[$1]  ;        C=$1; $1="";        A[C, FNUM]=$0 }

END {
        for(X in COL)
        {
                printf("%s", X);
                for(N=1; N<=FNUM; N  ) printf("%s", A[X, N]);
                printf("\n");
        }
}' *mosdepth.summary.txt > Se_combined.coverage.txt

The input data looks like this enter image description here

CodePudding user response：

Awk processes files in records, where the records are separated by the record separator RS. Each record is split in fields where the field separator is defined by the variable FS that the -F flag can define.

In the case of the program presented in the OP, the record separator is the default value which is the <newline>-character and the field separator is set to be the <tab>-character.

Awk programs are generally written as a sequence of pattern-action pairs of the form pattern { action }. These pairs are executed sequentially and state to perform action when pattern returns a non-zero or non-empty string value.

In the current program there are three such action-patter pairs:

F!=FILENAME { FNUM ; F=FILENAME }: This states that if the value of F is different from the current FILENAME which is processed, then increase the value of FNUM with one and update the value of F with the current FILENAME.

In the end, this is the same as just checking if we are processing a new file or not. The equivalent version of this would be:
```
(FNR==1) { FNUM   }
```
which reads: If we are processing the first record of the current file (FNR), then increase the file count FNUM.
{ COL[$1] ; C=$1; $1=""; A[C, FNUM]=$0 }: As there is no pattern, it implies true by default. So here, for each record/line increment the number of times you saw the value in the first column and store it in an associative array COL (key-value pairs). Memorize the first field in C and store in an array A the value of the current record, but remove the first field. So if the record of the second file reads "foo A B C D", and foo already been seen 3 times, then, COL["foo"] will be equal to 4 and A["foo",2] will read " A B C D".
END{ ... } This is a special pattern-action pair. Here END indicates that this action should only be executed at the end, when all files have been processed. What the end statement does, is straightforward, it just prints all records of each file. Including empty records.

In the end, this entire script can be simplified to the following:

awk 'BEGIN{ FS="\t" }
     { file_list[FILENAME]
       key_list[$1]
       record_list[FILENAME,$1]=$0 }
     END { for (key in key_list)
             for (fname in file_list) 
                print ( record_list[fname,key] ? record_list[fname,key] : key )
     }' file1 file2 file3 ...

CodePudding user response：

Assuming your '*mosdepth.summary.txt' files look like the following:

$ ls *mos*txt
1mosdepth.summary.txt 2mosdepth.summary.txt 3mosdepth.summary.txt

And contents are:

$ cat 1mosdepth.summary.txt
chrom   length  bases   mean    min max
contig_1_pilon  223468  1181176 5.29    0   860
contig_2_pilon  197061  2556215 12.97   0   217
contig_6_pilon  162902  2132156 13.09   0   80


$ cat 2mosdepth.summary.txt
chrom   length  bases   mean    min max
contig_19_pilon 286502  2067244 7.22    0   345
contig_29_pilon 263348  2222566 8.44    0   765
contig_32_pilon 291449  2671881 9.17  0 128
contig_34_pilon 51310   525393  10.24   0   47

$ cat 3mosdepth.summary.txt
chrom   length  bases   mean    min max
contig_37_pilon 548146  6652322 12.14   0   558
contig_41_pilon 7529    144989  19.26   0   71

The following awk command might be appropriate:

$ awk -v target_cols="1 2 3 4 5 6" 'BEGIN{split(target_cols, cols," ")} \
 NR==1{printf "%s ", "file#"; for (i=1;i<=length(cols);i  ) {printf "%s ", $cols[i]} print ""} \
FNR==1{fnbr  } \
FNR>=2{printf "%s ", fnbr; for (i=1;i<=length(cols);i  ) {printf "%s ", $cols[i]} print ""}' *mos*txt | column -t

Output:

file#  chrom            length  bases    mean   min  max
1      contig_1_pilon   223468  1181176  5.29   0    860
1      contig_2_pilon   197061  2556215  12.97  0    217
1      contig_6_pilon   162902  2132156  13.09  0    80
2      contig_19_pilon  286502  2067244  7.22   0    345
2      contig_29_pilon  263348  2222566  8.44   0    765
2      contig_32_pilon  291449  2671881  9.17   0    128
2      contig_34_pilon  51310   525393   10.24  0    47
3      contig_37_pilon  548146  6652322  12.14  0    558
3      contig_41_pilon  7529    144989   19.26  0    71

Alternatively, the following will output the filename rather than file#:

$ awk -v target_cols="1 2 3 4 5 6" 'BEGIN{split(target_cols, cols," ")} \
 NR==1{printf "%s ", "fname"; for (i=1;i<=length(cols);i  ) {printf "%s ", $cols[i]} print ""} \
FNR==1{fnbr=FILENAME} \
FNR>=2{printf "%s ", fnbr; fnbr="-"; for (i=1;i<=length(cols);i  ) {printf "%s ", $cols[i]} print ""}' *mos*txt | column -t

Output:

fname                  chrom            length  bases    mean   min  max
1mosdepth.summary.txt  contig_1_pilon   223468  1181176  5.29   0    860
-                      contig_2_pilon   197061  2556215  12.97  0    217
-                      contig_6_pilon   162902  2132156  13.09  0    80
2mosdepth.summary.txt  contig_19_pilon  286502  2067244  7.22   0    345
-                      contig_29_pilon  263348  2222566  8.44   0    765
-                      contig_32_pilon  291449  2671881  9.17   0    128
-                      contig_34_pilon  51310   525393   10.24  0    47
3mosdepth.summary.txt  contig_37_pilon  548146  6652322  12.14  0    558
-                      contig_41_pilon  7529    144989   19.26  0    71

With either command, the target_cols="1 2 3 4 5 6" specifies the targeted columns to extract.

target_cols="1 2 3" for example, will produce:

fname                  chrom            length  bases
1mosdepth.summary.txt  contig_1_pilon   223468  1181176
-                      contig_2_pilon   197061  2556215
-                      contig_6_pilon   162902  2132156
2mosdepth.summary.txt  contig_19_pilon  286502  2067244
-                      contig_29_pilon  263348  2222566
-                      contig_32_pilon  291449  2671881
-                      contig_34_pilon  51310   525393
3mosdepth.summary.txt  contig_37_pilon  548146  6652322
-                      contig_41_pilon  7529    144989

target_cols="4 5 6" will produce:

fname                  mean   min  max
1mosdepth.summary.txt  5.29   0    860
-                      12.97  0    217
-                      13.09  0    80
2mosdepth.summary.txt  7.22   0    345
-                      8.44   0    765
-                      9.17   0    128
-                      10.24  0    47
3mosdepth.summary.txt  12.14  0    558
-                      19.26  0    71