Remove a substring from lines starting with a specific character-CodePudding

I am trying to change long names in rows starting with >, so that I only keep the part till Stage_V_sporulation_protein...:

>tr_A0A024P1W8_A0A024P1W8_9BACI_Stage_V_sporulation_protein_AE_OS=Halobacillus_karajensis_OX=195088_GN=BN983_00096_PE=4_SV=1
MTFLWAFLVGGGICVIGQILLDVFKLTPAHVMSSFVVAGAVLDAFDLYDNLIRFAGGGATVPITSFGHSLLHGAMEQADEHGVIGVAIGIFELTSAGIASAILFGFIVAVIFKPKG
>tr_A0A060LWV2_A0A060LWV2_9BACI_SpoIVAD_sporulation_protein_AEB_OS=Alkalihalobacillus_lehensis_G1_OX=1246626_GN=BleG1_2089_PE=4_SV=1
MIFLWAFLVGGVICVIGQLLMDVVKLTPAHTMSTLVVSGAVLAGFGLYEPLVDFAGAGATVPITSFGNSLVQGAMEEANQVGLIGIITGIFEITSAGISAAIIFGFIAALIFKPKG

I am doing a loop:

cat file.txt | while read line; do 
  if [[ $line = \>* ]] ; then
    cut -d_ -f1-4 $line; 
  fi; 
done

but in addresses files but not rows in the file (I get cut: >>tr_A0A024P1W8_A0A024P1W8_9BACI_Stage_V_sporulation_protein_AE_OS=Halobacillus_karajensis_OX=195088_GN=BN983_00096_PE=4_SV=1: No such file or directory).

My desired output is:

>tr_A0A024P1W8_A0A024P1W8_9BACI        
MTFLWAFLVGGGICVIGQILLDVFKLTPAHVMSSFVVAGAVLDAFDLYDNLIRFAGGGATVPITSFGHSLLHGAMEQADEHGVIGVAIGIFELTSAGIASAILFGFIVAVIFKPKG
>tr_A0A060LWV2_A0A060LWV2_9BACI        
MIFLWAFLVGGVICVIGQLLMDVVKLTPAHTMSTLVVSGAVLAGFGLYEPLVDFAGAGATVPITSFGNSLVQGAMEEANQVGLIGIITGIFEITSAGISAAIIFGFIAALIFKPKG

How do I change actual rows?

CodePudding user response：

With the current state of the question, it seems easiest to do:

awk '/^>/ {print $1,$2,$3,$4; next}1' FS=_ OFS=_ file.txt

Lines that match the > at the beginning of the line get only the first four fields printed, separated by _ (the value of OFS). Lines that do not match are printing unchanged.

CodePudding user response：

One way using sed:

sed -E '/^>/s/(.*)_Stage_V_sporulation_protein/\1/' file

CodePudding user response：

A sed one-liner would be:

sed '/^>/s/^\(\([^_]*_\)\{3\}[^_]*\).*/\1/' file

CodePudding user response：

Use this Perl one-liner to process the headers in your FASTA file:

perl -lpe 'if ( m{^>} ) { @f = split m{_}, $_; splice @f, 4; $_ = join "_", @f; }' file.txt > out.txt

The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-p : Loop over the input one line at a time, assigning it to $_ by default. Add print $_ after each loop iteration.
-l : Strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing.

The one-liner uses split to split the input string on underscore into the array @f.
Then splice is used to remove from the array all elements except for the first 4 elements.
Finally, join joins these elements on an underscore.
All of the above is wrapped inside if ( m{^>} ) { ... } in order to limit the costly string manipulations only to the FASTA headers (the lines that start with >).