I've been task with kind of summarising a few files into a tsv file. I have to select specific data from a list of files and write it as a line of tab-seperated columns in a tsv file. Every line in the files have a 'name' as a first column so it is easy to filter data ($1 == "NAME"). One file == one line in tsv. So far I wrote this:
#! /bin/bash
cat > newFile.txt
for f in *.pdb; do
awk '$1 == "ACCESSION" {print $2}' ORS="/t" "$f" >> newFile.txt
awk '$1 == "DEFINITION" {print $2}' ORS="/t" "$f" >> newFile.txt
awk '$1 == "SOURCE" {print $2}' ORS="/t" "$f" >> newFile.txt
awk '$1 == "LOCUS" {print$4}' ORS="/r" "$f" >> newFile.txt
done
Obviously this attrocity of a code does not work. Is it possible to modify what I wrote and complete the task using awk?
Example of a file:
LOCUS \t NM_123456 \t 2000bp \t mRNA
DEFINITION \t Very nice gene from a very nice mouse
ACCESSION \t NM_123456
VERSION \t 1.000
SOURCE \t Very nice mouse
end result:
NM_123456 /t Very nice gene from a very nice mouse /t Very nice mouse /t mRNA
NM_345678 /t Not so nice gene from an angry elephant /t Angry Elephant /t mRNA
"/t" stands for a tab (I did not know how to write it down sorry). Also the example files contain much more information, I just gave a 'header' let's say.
CodePudding user response:
In plain bash:
for file in *.pdb; do
acc=
def=
src=
loc=
while IFS=$'\t' read -ra fields; do
if [[ ${fields[0]} = "ACCESSION" ]]; then
acc=${fields[1]}
elif [[ ${fields[0]} = "DEFINITION" ]]; then
def=${fields[1]}
elif [[ ${fields[0]} = "SOURCE" ]]; then
src=${fields[1]}
elif [[ ${fields[0]} = "LOCUS" ]]; then
loc=${fields[3]}
fi
done < "$file"
printf '%s\t%s\t%s\t%s\n' "$acc" "$def" "$src" "$loc" >> newFile.txt
done
CodePudding user response:
If every file has those lines in same order in every file, and they appear exactly once per file (no more, no less), you can do this:
awk '
$1 == "ASCESSION" {printf "%s\t", $2}
$1 == "DEFINITION" {printf "%s\t", $2}
$1 == "SOURCE" {printf "%s\t", $2}
$1 == "LOCUS" {print $4}' *.pdb > table.tsv
However, if the order of lines varies, or some files don't have every line, or some files have multiple lines the same (eg SOURCE foo
appears twice), you will need something more complex, like this:
awk '
function print_row(cols) {
for (i=0; i<3; i ) {
printf "%s\t", cols[i]
cols[i] = ""
}
print cols[3]
cols[3] = ""
}
NR!=FNR && FNR==1 {print_row(cols)}
$1 == "ASCESSION" {cols[0] = $2}
$1 == "DEFINITION" {cols[1] = $2}
$1 == "SOURCE" {cols[2] = $2}
$1 == "LOCUS" {cols[3] = $4}
END {print_row(cols)}' *.pdb > table.tsv
It always prints a neat table, with columns lining up correctly, regardless of the order of lines in a file, and even if some lines are missing or occur more than once. If a line occurs more than once, the last occurrence is used.
CodePudding user response:
If gawk
, which supports ENDFILE
block, is available, please try:
awk -F'\t' -v OFS='\t' ' # assign input/output field separator to a tab character
BEGIN {
split("ACCESSION,DEFINITION,SOURCE,LOCUS", names, ",")
# assign an array "names" to the list of names
}
{
if ($1 == "LOCUS") a[$1] = $4
else a[$1] = $2
}
ENDFILE { # this block is invoked after reading each file
print a[names[1]], a[names[2]], a[names[3]], a[names[4]]
# print a["ACCESSION"], a["DEFINITION"], .. in order as a tsv
delete a # clear array "a"
}' *.tsv
CodePudding user response:
This is probably what you're looking for, using any awk in any shell on every Unix box (untested):
awk '
BEGIN { FS=OFS="\t" }
{ f[$1] = ($1 == "LOCUS" ? $4 : $2) }
$1 == "SOURCE" {
print f["ACCESSION"], f["DEFINITION"], f["SOURCE"], f["LOCUS"]
}
' *.pdb > newFile.txt
The above assumes every input file has the same tag-value pairs as shown in the input file in your question and that SOURCE is always the last one.