I am working in bash with a fasta file with headers that begin with a ">" and end with either a "C" or a " ". Like so:
>chr1:35031657-35037706
GGTGGACTAGCCAGTGAATGTCAACGCGTCCCTA
CCTAAGGCGATATCCGCAGCCGCCCGCGTCCCTA
>chr1:71979382-71985425C
agattaaatgaactattacacataaagtgcttac
ttacacataaagtgcttacgaactattacaggga
I'd like to use awk (gsub?) or sed to change the last character of the header to a " " if it is a "C". Basically I want all of the sequences to end in " ". No C's.
Desired output:
>chr1:35031657-35037706
GGTGGACTAGCCAGTGAATGTCAACGCGTCCCTA
CCTAAGGCGATATCCGCAGCCGCCCGCGTCCCTA
>chr1:71979382-71985425
agattaaatgaactattacacataaagtgcttac
ttacacataaagtgcttacgaactattacaggga
Nothing needs to change with the sequences. I think this is pretty straight forward, but I'm struggling to use other posts to do this myself. I know that awk '/^>/ && /C$/{print $0}'
will print the headers than begin with ">" and end with "C", but I'm not sure how to replace all of those "C"s with " "s.
Thanks for your help!
CodePudding user response:
I think this would be easier to do in sed
:
sed '/^>/ s/C$/ /'
Translation: on lines starting with ">", replace "C" at the end of the line with " ". Note that if the "C" isn't matched, there isn't an error, it just doesn't replace anything. Also, unlike awk
, sed
automatically prints each line after processing it.
If you really want to use awk
, the equivalent would be:
awk '/^>/ {sub("C$"," ",$0)}; {print}'