I have a FASTA file of the form
ABCNA929-08|Lymantria_dispar_dispar|COI-5P|MF131764
and I want to extract everything before the first "|" delimiter, i.e., ABCNA929-08
This is easiest in Bash, but I'm not a frequent user, so I am unsure of the solution.
CodePudding user response:
Using AWK:
awk -F"|" '/\|/ {print $1}' file.fasta
Explanation: using "|" as a delimiter, search for lines containing the "|" character (FASTA headers only, not the >ATCGA...etc) and print the first field (i.e. everything up to the first "|").
Or, with bash:
while read -r line; do [[ $line =~ '|' ]] && echo ${line/|*/}; done < file.fasta
Does that solve your problem?
CodePudding user response:
Would you please try the following:
sed -E 's/^>?([^|] ).*/\1/' file.fasta
Output:
ABCNA929-08
It will work with or without the leading >
character in the FASTA header.
^>?
matches zero or one>
character at the beginning of the line.([^|] )
matches a sequence of any characters other than|
. The matched substring is captured as\1
..*
matches the remaining characters to be removed.- The substitution
s/^>?([^|] ).*/\1/
removes characters other than the matched substring as\1
.