Bash solution to extract part of FASTA header-CodePudding

I have a FASTA file of the form

ABCNA929-08|Lymantria_dispar_dispar|COI-5P|MF131764

and I want to extract everything before the first "|" delimiter, i.e., ABCNA929-08

This is easiest in Bash, but I'm not a frequent user, so I am unsure of the solution.

CodePudding user response：

Using AWK:

awk -F"|" '/\|/ {print $1}' file.fasta

Explanation: using "|" as a delimiter, search for lines containing the "|" character (FASTA headers only, not the >ATCGA...etc) and print the first field (i.e. everything up to the first "|").

Or, with bash:

while read -r line; do [[ $line =~ '|' ]] && echo ${line/|*/}; done < file.fasta

Does that solve your problem?

CodePudding user response：

Would you please try the following:

sed -E 's/^>?([^|] ).*/\1/' file.fasta

Output:

ABCNA929-08

It will work with or without the leading > character in the FASTA header.

^>? matches zero or one > character at the beginning of the line.
([^|] ) matches a sequence of any characters other than |. The matched substring is captured as \1.
.* matches the remaining characters to be removed.
The substitution s/^>?([^|] ).*/\1/ removes characters other than the matched substring as \1.