Select specific fields and write them to a new file-CodePudding

I've been task with kind of summarising a few files into a tsv file. I have to select specific data from a list of files and write it as a line of tab-seperated columns in a tsv file. Every line in the files have a 'name' as a first column so it is easy to filter data ($1 == "NAME"). One file == one line in tsv. So far I wrote this:

#! /bin/bash
cat > newFile.txt
for f in *.pdb; do
awk '$1 == "ACCESSION" {print $2}' ORS="/t" "$f" >> newFile.txt
awk '$1 == "DEFINITION" {print $2}' ORS="/t" "$f" >> newFile.txt
awk '$1 == "SOURCE" {print $2}' ORS="/t" "$f" >> newFile.txt
awk '$1 == "LOCUS" {print$4}' ORS="/r" "$f" >> newFile.txt
done

Obviously this attrocity of a code does not work. Is it possible to modify what I wrote and complete the task using awk?

Example of a file:

LOCUS \t NM_123456 \t 2000bp \t mRNA
DEFINITION \t Very nice gene from a very nice mouse
ACCESSION \t NM_123456
VERSION \t 1.000
SOURCE \t Very nice mouse

end result:

NM_123456 /t Very nice gene from a very nice mouse /t Very nice mouse /t mRNA
NM_345678 /t Not so nice gene from an angry elephant /t Angry Elephant /t mRNA

"/t" stands for a tab (I did not know how to write it down sorry). Also the example files contain much more information, I just gave a 'header' let's say.

CodePudding user response：

In plain bash:

for file in *.pdb; do
    acc=
    def=
    src=
    loc=
    while IFS=$'\t' read -ra fields; do
        if [[ ${fields[0]} = "ACCESSION" ]]; then
            acc=${fields[1]}
        elif [[ ${fields[0]} = "DEFINITION" ]]; then
            def=${fields[1]}
        elif [[ ${fields[0]} = "SOURCE" ]]; then
            src=${fields[1]}
        elif [[ ${fields[0]} = "LOCUS" ]]; then
            loc=${fields[3]}
        fi
    done < "$file"
    printf '%s\t%s\t%s\t%s\n' "$acc" "$def" "$src" "$loc" >> newFile.txt
done

CodePudding user response：

If every file has those lines in same order in every file, and they appear exactly once per file (no more, no less), you can do this:

awk '
$1 == "ASCESSION" {printf "%s\t", $2}
$1 == "DEFINITION" {printf "%s\t", $2}
$1 == "SOURCE" {printf "%s\t", $2}
$1 == "LOCUS" {print $4}' *.pdb > table.tsv

However, if the order of lines varies, or some files don't have every line, or some files have multiple lines the same (eg SOURCE foo appears twice), you will need something more complex, like this:

awk '
function print_row(cols) {
    for (i=0; i<3; i  ) {
        printf "%s\t", cols[i]
        cols[i] = ""
    }
    print cols[3]
    cols[3] = ""
}

NR!=FNR && FNR==1 {print_row(cols)}

$1 == "ASCESSION" {cols[0] = $2}
$1 == "DEFINITION" {cols[1] = $2}
$1 == "SOURCE" {cols[2] = $2}
$1 == "LOCUS" {cols[3] = $4}

END {print_row(cols)}' *.pdb > table.tsv

It always prints a neat table, with columns lining up correctly, regardless of the order of lines in a file, and even if some lines are missing or occur more than once. If a line occurs more than once, the last occurrence is used.

CodePudding user response：

If gawk, which supports ENDFILE block, is available, please try:

awk -F'\t' -v OFS='\t' '                # assign input/output field separator to a tab character
BEGIN {
    split("ACCESSION,DEFINITION,SOURCE,LOCUS", names, ",")
                                        # assign an array "names" to the list of names
}
{
    if ($1 == "LOCUS") a[$1] = $4
    else a[$1] = $2
}
ENDFILE {                               # this block is invoked after reading each file
    print a[names[1]], a[names[2]], a[names[3]], a[names[4]]
                                        # print a["ACCESSION"], a["DEFINITION"], .. in order as a tsv
    delete a                            # clear array "a"
}' *.tsv

CodePudding user response：

This is probably what you're looking for, using any awk in any shell on every Unix box (untested):

awk '
BEGIN { FS=OFS="\t" }
{ f[$1] = ($1 == "LOCUS" ? $4 : $2) }
$1 == "SOURCE" {
    print f["ACCESSION"], f["DEFINITION"], f["SOURCE"], f["LOCUS"]
}
' *.pdb > newFile.txt

The above assumes every input file has the same tag-value pairs as shown in the input file in your question and that SOURCE is always the last one.