Home > Back-end >  Questions about the sed command to remove the characters and create a new file
Questions about the sed command to remove the characters and create a new file

Time:12-02

I have a file called "Jimbleprot.pep" and It looks like this:

  119 >TRINITY_DN10025_c0_g1::TRINITY_DN10025_c0_g1_i1::g.7937::m.7937 TRINITY_DN10025_c0_g1::TRINITY_DN10025_c0_g1_i1::g.7937  ORF type:5prime_partial len:201 (-),score=29.41 TRINITY_DN10025_c0_g1_i1:192-794(-)
  120DEDDEPQELQVHVDVSLPGTPLGSENSVHSGCSHKSNVSSSLVAA
  123 TPTCVRLEEELHGPIRDGLS*
  124 >TRINITY_DN10026_c0_g1::TRINITY_DN10026_c0_g1_i1::g.7944::m.7944 TRINITY_DN10026_c0_g1::TRINITY_DN10026_c0_g1_i1::g.7944  ORF type:internal len:115 ( ),score=15.25,NACHT|PF05729.12|0.068,Tti2|PF10521.9|0.1      4 TRINITY_DN10026_c0_g1_i1:2-343( )
  125 SVEDLAIKILTTCGIPAPSVGEKEKLQELLKIVTRKCLLILDNLDHAFHADDKRRDSMDK
  126 SIRTTYKPSFSGNSTASSNISESSAAAFASSSANLPGIPTSISFAPYRSGIVHS
  127 >TRINITY_DN10028_c0_g1::TRINITY_DN10028_c0_g1_i1::g.7938::m.7938 TRINITY_DN10028_c0_g1::TRINITY_DN10028_c0_g1_i1::g.7938  ORF type:5prime_partial len:223 (-),score=32.40,gi|114149223|sp|Q9NUQ8.2|ABCF3_HUMA      N|66.97|6e-103,gi|114149223|sp|Q9NUQ8.2|ABCF3_HUMAN|23.39|7e-11,ABC_tran|PF00005.27|3.4e-23,AAA_21|PF13304.6|0.00057,AAA_21|PF13304.6|3.2e-05,SMC_N|PF02463.19|0.082,SMC_N|PF02463.19|6.7e-05,AAA_22|PF13401.      6|0.00019,AAA_30|PF13604.6|0.00038,AAA_16|PF13191.6|0.0019,AAA_29|PF13555.6|0.0081,NACHT|PF05729.12|0.0075,AAA_15|PF13175.6|0.015,AAA_15|PF13175.6|1.1e 03,Pox_A32|PF04665.12|0.01,Dynamin_N|PF00350.23|0.016      ,MobB|PF03205.14|0.026,DUF87|PF01935.17|0.039,DUF87|PF01935.17|3.3e 03,AAA_23|PF13476.6|0.033,AAA_24|PF13479.6|0.2,AAA_24|PF13479.6|1.1e 03,FtsK_SpoIIIE|PF01580.18|0.069,AAA_18|PF13238.6|0.079,AAA_18|PF132      38.6|3e 03,DLIC|PF05783.11|0.046,T2SSE|PF00437.20|0.063,SbcCD_C|PF13558.6|0.33,MMR_HSR1|PF01926.23|0.059,TniB|PF05621.11|36,TniB|PF05621.11|3.7,Arf|PF00025.21|0.27,Arf|PF00025.21|2.2e 03,AAA_19|PF13245.6|0      .18,TsaE|PF02367.17|0.23,Roc|PF08477.13|0.22 TRINITY_DN10028_c0_g1_i1:59-727(-)
  128IAVVGDNGSGKTTLLKILLGELEPVKLATKFPGKNVEHYRHQLG
RYGVSGDLATRFQGGVILVSHDERLVRSMCDEVWVCGNRQVKSIEGGFDQYKRMVQEELQAVLQ*
  132 >TRINITY_DN1002_c0_g1::TRINITY_DN1002_c0_g1_i1::g.2343::m.2343 TRINITY_DN1002_c0_g1::TRINITY_DN1002_c0_g1_i1::g.2343  ORF type:5prime_partial len:174 ( ),score=44.19,EF-hand_7|PF13499.6|0.15,EF-hand_6|PF13      405.6|1.2,EF-hand_6|PF13405.6|8.8e 02,EF-hand_6|PF13405.6|6e 03,EF-hand_6|PF13405.6|1.3e 03,EF-hand_1|PF00036.32|0.16,EF-hand_1|PF00036.32|5.8e 03,EF-hand_1|PF00036.32|6.3e 03,EF-hand_1|PF00036.32|8.6e 03       TRINITY_DN1002_c0_g1_i1:2-523( )
  133 EQHPKIRMQIAQKVFNVLDPDKKGYANKDDIMALTVDKLKAIADIVDPDYANTEEYEHVL
  134 FGEAEVRDAFQDALEEGNGELHLQKLIQKYKDLGGSEKVARELFAMLKPKSKDKATADEV
  135 EENLSNVLELYKKIRDEDKSGIFYDRHLQEDKEIMAKTLHEKTDTDGTKHDEL*
  136 >TRINITY_DN1002_c0_g1::TRINITY_DN1002_c0_g1_i2::g.2344::m.2344 TRINITY_DN1002_c0_g1::TRINITY_DN1002_c0_g1_i2::g.2344  ORF type:5prime_partial len:174 ( ),score=54.09,EF-hand_7|PF13499.6|0.13,EF-hand_8|PF13      833.6|1.6,EF-hand_8|PF13405.6|1.2,EF-hand_6|PF13405.6|1.5e 03,EF-hand_6|PF13405.6|6e 03,EF-hand_6|PF13405.6|8.8e 02,EF-hand_1|PF00036.32|0.16,EF-hand _1|PF00036.32|PF00036.32|6.7e 03 TRINITY_DN1002_c0_g1_i2:2-523( )
  137 EQHPKIRMQIAQKVFNVLDPDKKGYANKDDIMALTVDKLKAIADIVDPDYANTEEYEHVL
  138 FGEEEVRDAFQDALEEGNGELHLQKLILKYKDLGGSEKVARELFAMLKPKSKDKATADEV
  139 EENLSNVLELYKKIRDDDKSGVFYDRHLQEDKEIMAKTLDEGTNTDGAKHEEL*

and this file is about more than 8000 lines , I want to use the sed command to remove the * in the first 5000 line and output the result to a new file called "newprot.fasta". I think I may need the | to headbut I'm not sure how to do it and use them in the sed command. Thank you!

CodePudding user response:

ed is, as is usually the case when you're working with a file instead of a stream that's part of a pipeline, a better choice than sed here:

printf '%s\n' '1,5000 g/\*/s/\*//g' '1,5000 w newprot.fasta' Q | ed -s Jimbleprot.pep

will remove every * character in the first 5000 lines of the file, and then save those first 5000 lines in the new file.

Can also be written with ed's input as a heredoc instead if you prefer it for readability:

ed -s Jimbleprot.pep <<EOF
1,5000 g/\*/s/\*//g
1,5000 w newprot.fasta
Q
EOF

CodePudding user response:

My way is to divide it to 3 parts (maybe there is better way):

  1. take only the first 5000 lines in new file and delete the *.

    sed '5001,$ d' Jimbleprot.pep | tr -d '*' > temp_file.txt
    
  2. save only the last 3000 - raws in the Jimbleprot.pep file:

    tail -n  5001 Jimbleprot.pep > test.tmp && mv test.tmp Jimbleprot.pep
    
  3. Unify all files to the newprot.fasta:

    cat temp_file.txt Jimbleprot.pep > newprot.fasta
    

It should work, again i don't know if it's the best way but it's my way.

Edit: if you want only the first 5000 lines without the * just do step 1.

CodePudding user response:

This might work for you (GNU sed):

sed '1,5000s/\*//g;w newprot.fasta' Jimbleprot.pep

This will alter the first 5000 lines removing any *'s and the write every line to newprot.fasta.

If only the first 5000 line are required in newprot.fasta, use:

sed -e '1,5000s/\*//g;w newprot.fasta' -e '5000q' Jimbleprot.pep
  • Related