I have trouble understandig an awk command which I want to change slightly (but can't because I don't understand the code enough). The result of this awk command is to parse together text files having 6 columns (always same columns order but containing potentially different data).
First, I would like to only parse some specific columns from these files and not all 6. I couldn't figure out where to specify it in the awk loop.
Secondly, the header of the columns are not the first row of the output file anymore. It would be nice to have it as header in the output file as well.
Thirdly, I need to know from which file the data comes from. I know that the command take the files in the order they appear when doing ls -lh *mosdepth.summary.txt so I can deduce that the first 6 columns are from file 1, the 6 next from file 2, ect. However, I would like to automatically have this information in the output file to reduce the potential human errors I can do by infering the origin of the data.
Here is the awk command
awk -F"\t" -v OFS="\t" 'F!=FILENAME { FNUM ; F=FILENAME }
{ COL[$1] ; C=$1; $1=""; A[C, FNUM]=$0 }
END {
for(X in COL)
{
printf("%s", X);
for(N=1; N<=FNUM; N ) printf("%s", A[X, N]);
printf("\n");
}
}' *mosdepth.summary.txt > Se_combined.coverage.txt
The input data looks like this enter image description here
CodePudding user response:
Awk processes files in records, where the records are separated by the record separator RS
. Each record is split in fields where the field separator is defined by the variable FS
that the -F
flag can define.
In the case of the program presented in the OP, the record separator is the default value which is the <newline>-character and the field separator is set to be the <tab>-character.
Awk programs are generally written as a sequence of pattern-action pairs of the form pattern { action }
. These pairs are executed sequentially and state to perform action
when pattern
returns a non-zero or non-empty string value.
In the current program there are three such action-patter pairs:
F!=FILENAME { FNUM ; F=FILENAME }
: This states that if the value ofF
is different from the currentFILENAME
which is processed, then increase the value ofFNUM
with one and update the value ofF
with the currentFILENAME
.In the end, this is the same as just checking if we are processing a new file or not. The equivalent version of this would be:
(FNR==1) { FNUM }
which reads: If we are processing the first record of the current file (
FNR
), then increase the file countFNUM
.{ COL[$1] ; C=$1; $1=""; A[C, FNUM]=$0 }
: As there is no pattern, it implies true by default. So here, for each record/line increment the number of times you saw the value in the first column and store it in an associative arrayCOL
(key-value pairs). Memorize the first field inC
and store in an arrayA
the value of the current record, but remove the first field. So if the record of the second file reads "foo A B C D", andfoo
already been seen 3 times, then,COL["foo"]
will be equal to 4 andA["foo",2]
will read " A B C D".END{ ... }
This is a special pattern-action pair. HereEND
indicates that thisaction
should only be executed at the end, when all files have been processed. What the end statement does, is straightforward, it just prints all records of each file. Including empty records.
In the end, this entire script can be simplified to the following:
awk 'BEGIN{ FS="\t" }
{ file_list[FILENAME]
key_list[$1]
record_list[FILENAME,$1]=$0 }
END { for (key in key_list)
for (fname in file_list)
print ( record_list[fname,key] ? record_list[fname,key] : key )
}' file1 file2 file3 ...
CodePudding user response:
Assuming your '*mosdepth.summary.txt' files look like the following:
$ ls *mos*txt
1mosdepth.summary.txt 2mosdepth.summary.txt 3mosdepth.summary.txt
And contents are:
$ cat 1mosdepth.summary.txt
chrom length bases mean min max
contig_1_pilon 223468 1181176 5.29 0 860
contig_2_pilon 197061 2556215 12.97 0 217
contig_6_pilon 162902 2132156 13.09 0 80
$ cat 2mosdepth.summary.txt
chrom length bases mean min max
contig_19_pilon 286502 2067244 7.22 0 345
contig_29_pilon 263348 2222566 8.44 0 765
contig_32_pilon 291449 2671881 9.17 0 128
contig_34_pilon 51310 525393 10.24 0 47
$ cat 3mosdepth.summary.txt
chrom length bases mean min max
contig_37_pilon 548146 6652322 12.14 0 558
contig_41_pilon 7529 144989 19.26 0 71
The following awk
command might be appropriate:
$ awk -v target_cols="1 2 3 4 5 6" 'BEGIN{split(target_cols, cols," ")} \
NR==1{printf "%s ", "file#"; for (i=1;i<=length(cols);i ) {printf "%s ", $cols[i]} print ""} \
FNR==1{fnbr } \
FNR>=2{printf "%s ", fnbr; for (i=1;i<=length(cols);i ) {printf "%s ", $cols[i]} print ""}' *mos*txt | column -t
Output:
file# chrom length bases mean min max
1 contig_1_pilon 223468 1181176 5.29 0 860
1 contig_2_pilon 197061 2556215 12.97 0 217
1 contig_6_pilon 162902 2132156 13.09 0 80
2 contig_19_pilon 286502 2067244 7.22 0 345
2 contig_29_pilon 263348 2222566 8.44 0 765
2 contig_32_pilon 291449 2671881 9.17 0 128
2 contig_34_pilon 51310 525393 10.24 0 47
3 contig_37_pilon 548146 6652322 12.14 0 558
3 contig_41_pilon 7529 144989 19.26 0 71
Alternatively, the following will output the filename rather than file#:
$ awk -v target_cols="1 2 3 4 5 6" 'BEGIN{split(target_cols, cols," ")} \
NR==1{printf "%s ", "fname"; for (i=1;i<=length(cols);i ) {printf "%s ", $cols[i]} print ""} \
FNR==1{fnbr=FILENAME} \
FNR>=2{printf "%s ", fnbr; fnbr="-"; for (i=1;i<=length(cols);i ) {printf "%s ", $cols[i]} print ""}' *mos*txt | column -t
Output:
fname chrom length bases mean min max
1mosdepth.summary.txt contig_1_pilon 223468 1181176 5.29 0 860
- contig_2_pilon 197061 2556215 12.97 0 217
- contig_6_pilon 162902 2132156 13.09 0 80
2mosdepth.summary.txt contig_19_pilon 286502 2067244 7.22 0 345
- contig_29_pilon 263348 2222566 8.44 0 765
- contig_32_pilon 291449 2671881 9.17 0 128
- contig_34_pilon 51310 525393 10.24 0 47
3mosdepth.summary.txt contig_37_pilon 548146 6652322 12.14 0 558
- contig_41_pilon 7529 144989 19.26 0 71
With either command, the target_cols="1 2 3 4 5 6"
specifies the targeted columns to extract.
target_cols="1 2 3"
for example, will produce:
fname chrom length bases
1mosdepth.summary.txt contig_1_pilon 223468 1181176
- contig_2_pilon 197061 2556215
- contig_6_pilon 162902 2132156
2mosdepth.summary.txt contig_19_pilon 286502 2067244
- contig_29_pilon 263348 2222566
- contig_32_pilon 291449 2671881
- contig_34_pilon 51310 525393
3mosdepth.summary.txt contig_37_pilon 548146 6652322
- contig_41_pilon 7529 144989
target_cols="4 5 6"
will produce:
fname mean min max
1mosdepth.summary.txt 5.29 0 860
- 12.97 0 217
- 13.09 0 80
2mosdepth.summary.txt 7.22 0 345
- 8.44 0 765
- 9.17 0 128
- 10.24 0 47
3mosdepth.summary.txt 12.14 0 558
- 19.26 0 71