Home > Blockchain >  Bash solution to extract part of FASTA header
Bash solution to extract part of FASTA header

Time:02-25

I have a FASTA file of the form

ABCNA929-08|Lymantria_dispar_dispar|COI-5P|MF131764

and I want to extract everything before the first "|" delimiter, i.e., ABCNA929-08

This is easiest in Bash, but I'm not a frequent user, so I am unsure of the solution.

CodePudding user response:

Using AWK:

awk -F"|" '/\|/ {print $1}' file.fasta

Explanation: using "|" as a delimiter, search for lines containing the "|" character (FASTA headers only, not the >ATCGA...etc) and print the first field (i.e. everything up to the first "|").

Or, with bash:

while read -r line; do [[ $line =~ '|' ]] && echo ${line/|*/}; done < file.fasta

Does that solve your problem?

CodePudding user response:

Would you please try the following:

sed -E 's/^>?([^|] ).*/\1/' file.fasta

Output:

ABCNA929-08

It will work with or without the leading > character in the FASTA header.

  • ^>? matches zero or one > character at the beginning of the line.
  • ([^|] ) matches a sequence of any characters other than |. The matched substring is captured as \1.
  • .* matches the remaining characters to be removed.
  • The substitution s/^>?([^|] ).*/\1/ removes characters other than the matched substring as \1.
  •  Tags:  
  • bash
  • Related