somehow I can't wrap my head around this. I have the following string:
>sp.A9L976 PSBA_LEMMI Photosystem II protein D1 organism=Lemna minor taxid=4472 gene=psbA
I would like to use sed to remove the string between the 1th and 2nd occurrence of a space. Hence, in this case, the PSBA_LEMMI
should be removed. The string between the first two spaces does not contain any special characters.
So far I tried the following:
sed 's/\s.*\s/\s/'
But this removes everything unitl the last occurring space string, resulting in:>sp.A9L976 TESTgene=psbA
. I thought by leaving out the greedy expression g
sed will only match the first occurrence of the string. I also tried:
sed 's/(?<=\s).*(?=\s)//'
But this did not match / remove anything. Can someone help me out here? What am I missing?
CodePudding user response:
You can use
sed -E 's/\s \S \s / /'
sed -E 's/[[:space:]] [^[:space:]] [[:space:]] / /'
The two POSIX ERE patterns are the same, they match one or more whitespaces, one or more non-whitespaces, and one or more whitespaces, just \s
and \S
pattern can only be used in the GNU sed
version.
Note that you cannot use \s
as a whitespace char in the replacement part. \s
is a regex pattern, and regex is used in the LHS (left-hand side) to search for whitespaces. So, a literal space is required to replace with a space.
CodePudding user response:
You can try this sed
sed 's/\(\.[^\s]*\) .[^\s]* \(.*\)/\1 \2/' input_file
This utilizes grouping to exclude the match between the first and second occurance of a space.
Output
>sp.A9L976 Photosystem II protein D1 organism=Lemna minor taxid=4472 gene=psbA
CodePudding user response:
To edit the header of the fasta file as you specify, use this Perl one-liner:
echo '>sp.A9L976 PSBA_LEMMI Photosystem II protein D1 organism=Lemna minor taxid=4472 gene=psbA' | perl -lpe 's{^(>\S \s )\S \s (.*)}{$1$2}'
Prints:
>sp.A9L976 Photosystem II protein D1 organism=Lemna minor taxid=4472 gene=psbA
To edit the file in place:
perl -i.bak -lpe 's{^(>\S \s )\S \s (.*)}{$1$2}' in_file.fasta
The Perl one-liner uses these command line flags:
-e
: Tells Perl to look for code in-line, instead of in a file.
-p
: Loop over the input one line at a time, assigning it to $_
by default. Add print $_
after each loop iteration.
-l
: Strip the input line separator ("\n"
on *NIX by default) before executing the code in-line, and append it when printing.
-i.bak
: Edit input files in-place (overwrite the input file). Before overwriting, save a backup copy of the original file by appending to its name the extension .bak
.
Here,
^
: beginning of the line.
>
: literal "greater than" character, which marks the beginning of the header in fasta format specifications.
\S
: 1 or more non-whitespace characters.
\s
: 1 or more whitespace characters.
.*
: any character, 0 or more occurrences.
$1
, $2
: 1st and 2nd captured pattern. Capture occurs using parentheses: (...)
.
SEE ALSO:
perldoc perlrun
: how to execute the Perl interpreter: command line switches
perldoc perlre
: Perl regular expressions (regexes)