Output specific fields using bash-CodePudding

I have a test.fasta file with the following data:

>PPP.0124.1.PC lib=RU01 length=410 description=Protein description goes here 1 serine/threonine  
MLEAPKFTGIIGLNNNHDNYDLSQGFYHKLGEGSNMSIDSFGSLQLSNGG
GSVAMSVSSVGSNDSHTRILNHQGLKRVNGNYSVARSVNRGKVSHGLSDD
ALAQ
>PPP.14552.PC lib=RU01 length=104 description=Protein description goes here 2 uncharacterized protein LOC11441
MKSVVGMVVSNKMQKSVVVAVDRLFHHKLYDRYVKRTSKFMAHDEHNLCN
IGDRVRL
>PPP.94014.PC lib=RU01 length=206 description=Protein description goes here 3 some more chemicals and stuff
MDLGPTLTLQKGRQRRGKGPYAGVRSRGGRWVSEIRIPKTKTRIWLGSHH
SPEKAARAYDAALYCLKGEHGSFNFPNNRGPYLANRSVGSLPVDEIQCIA
AEFSCFDDSA

I would like to take the ID and the description and output them into a .tsv file, with the first column being the ID and the second column holding the description.

Desired output:

| ID | Description |
| -------- | -------------- |
| 0124    | Protein description goes here 1 serine/threonine           |
| 14552   | Protein description goes here 2 uncharacterized protein LOC11441            |
| 94014 | Protein description goes here 3 some more chemicals and stuff |

Any ideas on a one-line Bash command to achieve this?

I currently have this:

grep -a '^>' test.fasta |
awk '{print $1}

which gives me the first lines and the ID's but cant seem to figure out the rest!

CodePudding user response：

You can use the following one-line bash command to extract the IDs and descriptions from your test.fasta file and output them in a tab-separated values (TSV) format:

grep -a '^>' test.fasta | awk '{gsub(/^>/, ""); print $1 "\t" $2}'

CodePudding user response：

Using awk:

awk 'BEGIN{print "id\tdescription"} \
/.PC / && !/uncharac/ { \
split($1,b,"."); id=b[2]; \
$1=""; $2=""; $3=""; $(NF)=""; $(NF-1)=""; $(NF-2)=""; \
gsub("description=",""); print id"\t"$0} \
/.PC / && /uncharac/ { \
split($1,b,"."); id=b[2]; \
$1=""; $2=""; $3=""; $(NF)=""; $(NF-1)=""; \
gsub("description=",""); print id"\t"$0}' test.fasta

id  description
0124       Protein description goes here 1 || serine/threonine   
14552      Protein description goes here 2 || uncharacterized protein LOC11441  
94014      UProtein description goes here 3 || some more chemicals and stuff

Since the description is can span n columns, you need to remove 'known', unwanted columns. In your test data there seem to be records that can be differentiated by 'uncharacterized protein' or not. Records that have 'uncharacterized protein' only need to have 2 trailing columns removed, while other records need to have 3 trailing columns removed.

Parse the id from the first column: split($1,b,"."); id=b[2];

Remove unwanted columns: $1=""; $2=""; $3=""; $(NF)=""; $(NF-1)=""; $(NF-2)=""; OR $1=""; $2=""; $3=""; $(NF)=""; $(NF-1)=""; (if uncharacterized protein).

Clean the description by removing 'description=': gsub("description=","");

CodePudding user response：

Here's a simple sed script:

sed -n '/^>[^0-9]*\([0-9][0-9]*\).*description=/\1\t/p' test.fasta

The same could easily be recast into Awk, though it's arguably less elegant.

awk -F . 'BEGIN { OFS="\t" }
    /^>/ { d=$0; sub(/.*description=/, "", d); print $2, d }' test.fasta

which assumes the interesting part of the ID is between the first and second dots always, and avoids the useless cat.

(The sed variant assumes the first sequence of digits on the line is the ID. It alse requires that your sed interprets \t as a literal tab, which isn't entirely portable.)

I had to guess some requirements; if my guesses are wrong, please edit your question to clarify exactly how the numeric ID should be extracted, for example.