I am trying to change long names in rows starting with >
, so that I only keep the part till Stage_V_sporulation_protein...
:
>tr_A0A024P1W8_A0A024P1W8_9BACI_Stage_V_sporulation_protein_AE_OS=Halobacillus_karajensis_OX=195088_GN=BN983_00096_PE=4_SV=1
MTFLWAFLVGGGICVIGQILLDVFKLTPAHVMSSFVVAGAVLDAFDLYDNLIRFAGGGATVPITSFGHSLLHGAMEQADEHGVIGVAIGIFELTSAGIASAILFGFIVAVIFKPKG
>tr_A0A060LWV2_A0A060LWV2_9BACI_SpoIVAD_sporulation_protein_AEB_OS=Alkalihalobacillus_lehensis_G1_OX=1246626_GN=BleG1_2089_PE=4_SV=1
MIFLWAFLVGGVICVIGQLLMDVVKLTPAHTMSTLVVSGAVLAGFGLYEPLVDFAGAGATVPITSFGNSLVQGAMEEANQVGLIGIITGIFEITSAGISAAIIFGFIAALIFKPKG
I am doing a loop:
cat file.txt | while read line; do
if [[ $line = \>* ]] ; then
cut -d_ -f1-4 $line;
fi;
done
but in addresses files but not rows in the file (I get cut: >>tr_A0A024P1W8_A0A024P1W8_9BACI_Stage_V_sporulation_protein_AE_OS=Halobacillus_karajensis_OX=195088_GN=BN983_00096_PE=4_SV=1: No such file or directory
).
My desired output is:
>tr_A0A024P1W8_A0A024P1W8_9BACI
MTFLWAFLVGGGICVIGQILLDVFKLTPAHVMSSFVVAGAVLDAFDLYDNLIRFAGGGATVPITSFGHSLLHGAMEQADEHGVIGVAIGIFELTSAGIASAILFGFIVAVIFKPKG
>tr_A0A060LWV2_A0A060LWV2_9BACI
MIFLWAFLVGGVICVIGQLLMDVVKLTPAHTMSTLVVSGAVLAGFGLYEPLVDFAGAGATVPITSFGNSLVQGAMEEANQVGLIGIITGIFEITSAGISAAIIFGFIAALIFKPKG
How do I change actual rows?
CodePudding user response:
With the current state of the question, it seems easiest to do:
awk '/^>/ {print $1,$2,$3,$4; next}1' FS=_ OFS=_ file.txt
Lines that match the >
at the beginning of the line get only the first four fields printed, separated by _
(the value of OFS
). Lines that do not match are printing unchanged.
CodePudding user response:
One way using sed:
sed -E '/^>/s/(.*)_Stage_V_sporulation_protein/\1/' file
CodePudding user response:
A sed
one-liner would be:
sed '/^>/s/^\(\([^_]*_\)\{3\}[^_]*\).*/\1/' file
CodePudding user response:
Use this Perl one-liner to process the headers in your FASTA file:
perl -lpe 'if ( m{^>} ) { @f = split m{_}, $_; splice @f, 4; $_ = join "_", @f; }' file.txt > out.txt
The Perl one-liner uses these command line flags:
-e
: Tells Perl to look for code in-line, instead of in a file.
-p
: Loop over the input one line at a time, assigning it to $_
by default. Add print $_
after each loop iteration.
-l
: Strip the input line separator ("\n"
on *NIX by default) before executing the code in-line, and append it when printing.
The one-liner uses split
to split the input string on underscore into the array @f
.
Then splice
is used to remove from the array all elements except for the first 4 elements.
Finally, join
joins these elements on an underscore.
All of the above is wrapped inside if ( m{^>} ) { ... }
in order to limit the costly string manipulations only to the FASTA headers (the lines that start with >
).
SEE ALSO:
perldoc perlrun
: how to execute the Perl interpreter: command line switches