Home > Net >  Replace single character in fasta header with awk or sed
Replace single character in fasta header with awk or sed

Time:12-18

I am working in bash with a fasta file with headers that begin with a ">" and end with either a "C" or a " ". Like so:

>chr1:35031657-35037706 
GGTGGACTAGCCAGTGAATGTCAACGCGTCCCTA
CCTAAGGCGATATCCGCAGCCGCCCGCGTCCCTA
>chr1:71979382-71985425C
agattaaatgaactattacacataaagtgcttac
ttacacataaagtgcttacgaactattacaggga

I'd like to use awk (gsub?) or sed to change the last character of the header to a " " if it is a "C". Basically I want all of the sequences to end in " ". No C's.

Desired output:

>chr1:35031657-35037706 
GGTGGACTAGCCAGTGAATGTCAACGCGTCCCTA
CCTAAGGCGATATCCGCAGCCGCCCGCGTCCCTA
>chr1:71979382-71985425 
agattaaatgaactattacacataaagtgcttac
ttacacataaagtgcttacgaactattacaggga

Nothing needs to change with the sequences. I think this is pretty straight forward, but I'm struggling to use other posts to do this myself. I know that awk '/^>/ && /C$/{print $0}' will print the headers than begin with ">" and end with "C", but I'm not sure how to replace all of those "C"s with " "s.

Thanks for your help!

CodePudding user response:

I think this would be easier to do in sed:

sed '/^>/ s/C$/ /'

Translation: on lines starting with ">", replace "C" at the end of the line with " ". Note that if the "C" isn't matched, there isn't an error, it just doesn't replace anything. Also, unlike awk, sed automatically prints each line after processing it.

If you really want to use awk, the equivalent would be:

awk '/^>/ {sub("C$"," ",$0)}; {print}'
  • Related