I have a test.fasta file with the following data:
>PPP.0124.1.PC lib=RU01 length=410 description=Protein description goes here 1 serine/threonine
MLEAPKFTGIIGLNNNHDNYDLSQGFYHKLGEGSNMSIDSFGSLQLSNGG
GSVAMSVSSVGSNDSHTRILNHQGLKRVNGNYSVARSVNRGKVSHGLSDD
ALAQ
>PPP.14552.PC lib=RU01 length=104 description=Protein description goes here 2 uncharacterized protein LOC11441
MKSVVGMVVSNKMQKSVVVAVDRLFHHKLYDRYVKRTSKFMAHDEHNLCN
IGDRVRL
>PPP.94014.PC lib=RU01 length=206 description=Protein description goes here 3 some more chemicals and stuff
MDLGPTLTLQKGRQRRGKGPYAGVRSRGGRWVSEIRIPKTKTRIWLGSHH
SPEKAARAYDAALYCLKGEHGSFNFPNNRGPYLANRSVGSLPVDEIQCIA
AEFSCFDDSA
I would like to take the ID and the description and output them into a .tsv
file, with the first column being the ID and the second column holding the description.
Desired output:
| ID | Description |
| -------- | -------------- |
| 0124 | Protein description goes here 1 serine/threonine |
| 14552 | Protein description goes here 2 uncharacterized protein LOC11441 |
| 94014 | Protein description goes here 3 some more chemicals and stuff |
Any ideas on a one-line Bash command to achieve this?
I currently have this:
grep -a '^>' test.fasta |
awk '{print $1}
which gives me the first lines and the ID's but cant seem to figure out the rest!
CodePudding user response:
You can use the following one-line bash command to extract the IDs and descriptions from your test.fasta file and output them in a tab-separated values (TSV) format:
grep -a '^>' test.fasta | awk '{gsub(/^>/, ""); print $1 "\t" $2}'
CodePudding user response:
Using awk
:
awk 'BEGIN{print "id\tdescription"} \
/.PC / && !/uncharac/ { \
split($1,b,"."); id=b[2]; \
$1=""; $2=""; $3=""; $(NF)=""; $(NF-1)=""; $(NF-2)=""; \
gsub("description=",""); print id"\t"$0} \
/.PC / && /uncharac/ { \
split($1,b,"."); id=b[2]; \
$1=""; $2=""; $3=""; $(NF)=""; $(NF-1)=""; \
gsub("description=",""); print id"\t"$0}' test.fasta
id description
0124 Protein description goes here 1 || serine/threonine
14552 Protein description goes here 2 || uncharacterized protein LOC11441
94014 UProtein description goes here 3 || some more chemicals and stuff
Since the description is can span n
columns, you need to remove 'known', unwanted columns. In your test data there seem to be records that can be differentiated by 'uncharacterized protein' or not. Records that have 'uncharacterized protein' only need to have 2 trailing columns removed, while other records need to have 3 trailing columns removed.
Parse the id from the first column: split($1,b,"."); id=b[2];
Remove unwanted columns: $1=""; $2=""; $3=""; $(NF)=""; $(NF-1)=""; $(NF-2)="";
OR $1=""; $2=""; $3=""; $(NF)=""; $(NF-1)="";
(if uncharacterized protein).
Clean the description by removing 'description=': gsub("description=","");
CodePudding user response:
Here's a simple sed
script:
sed -n '/^>[^0-9]*\([0-9][0-9]*\).*description=/\1\t/p' test.fasta
The same could easily be recast into Awk, though it's arguably less elegant.
awk -F . 'BEGIN { OFS="\t" }
/^>/ { d=$0; sub(/.*description=/, "", d); print $2, d }' test.fasta
which assumes the interesting part of the ID is between the first and second dots always, and avoids the useless cat
.
(The sed
variant assumes the first sequence of digits on the line is the ID. It alse requires that your sed
interprets \t
as a literal tab, which isn't entirely portable.)
I had to guess some requirements; if my guesses are wrong, please edit your question to clarify exactly how the numeric ID should be extracted, for example.