Home > Mobile >  Reverse complement SOME sequences in fasta file
Reverse complement SOME sequences in fasta file

Time:12-17

I've been reading lots of helpful posts about reverse complementing sequences, but I've got what seems to be an unusual request. I'm working in bash and I have DNA sequences in fasta format in my stdout that I'd like to pass on down the pipe. The seemingly unusual bit is that I'm trying to reverse complement SOME of those sequences, so that the output has all the sequences in the same direction (for multiple sequence alignment later).

My fasta headers end in either "C" or " ". I'd like to reverse complement the ones that end in "C". Here's a little subset:

>chr1:86214203-86220231 
CTGGTGGTACAGCTACATTGTACCATAAAACTTATTCATATTAAAACTTA
TTTATATGTACCTCAAAAGATTAAACTGGGAGATAAGGTGTGGCATTTTT
>chr1:84518073-84524089C
caccttagagataatgaagtatattcagaatgtagaacattctataagac
aactgacccaatatcttttaaaaagtcaatgccatgttaaaaataaaaag

I know there are lots of ways to reverse complement out there, like:

echo ACCTTGAAA | tr ACGTacgt TGCAtgca | rev

and

seqtk seq -r in.fa > out.fa

But I'm not sure how to do this for only those sequences that have a C at the end of the header. I think awk or sed is probably the ticket, but I'm at a loss as to how to actually code it. I can get the sequence headers with awk, like:

awk '/^>/ { print $0 }'
>chr1:84518073-84524089C
>chr1:86214203-86220231 

But if someone could help me figure out how to turn that awk statement into one that asks "if the last character in the header has a C, do this!" that would be great!

CodePudding user response:

You didn't provide the expected output so it's guess but this may be what you're trying to do, using any awk:

$ cat tst.awk
/^>/ {
    type = substr($0,length($0),1)
}
!/^>/ && (type == "C") {
    $0 = rev( tr( $0, "ACGTacgt TGCAtgca" ) )
}
{ print }

function tr(oldStr,trStr,   i,lgth,char,newStr) {
    if ( !_trSeen[trStr]   ) {
        lgth = (length(trStr) - 1) / 2
        for ( i=1; i<=lgth; i   ) {
            _trMap[trStr,substr(trStr,i,1)] = substr(trStr,lgth 1 i,1)
        }
    }
    lgth = length(oldStr)
    for (i=1; i<=lgth; i  ) {
        char = substr(oldStr,i,1)
        newStr = newStr ( (trStr,char) in _trMap ? _trMap[trStr,char] : char )
    }
    return newStr
}

function rev(oldStr,    i,lgth,char,newStr) {
    lgth = length(oldStr)
    for ( i=1; i<=lgth; i   ) {
        char = substr(oldStr,i,1)
        newStr = char newStr
    }
    return newStr
}

$ awk -f tst.awk file
>chr1:86214203-86220231 
CTGGTGGTACAGCTACATTGTACCATAAAACTTATTCATATTAAAACTTA
TTTATATGTACCTCAAAAGATTAAACTGGGAGATAAGGTGTGGCATTTTT
>chr1:84518073-84524089C
gtcttatagaatgttctacattctgaatatacttcattatctctaaggtg
ctttttatttttaacatggcattgactttttaaaagatattgggtcagtt

CodePudding user response:

An earlier answer (by Ed Morton) uses a self-contained awk procedure to selectively reverse-complement sequences following a comment line ending with "C". Although I think that to be the best approach, I will offer an alternative approach that might have wider applicability.

The procedure here uses awk's system() function to send data extracted from the fasta file in awk to the shell where the sequence can be processed by any of the many shell applications existing for sequence manipulation.

I have defined an awk user function to pass the isolated sequence from awk to the shell. It can be called from any part of the awk procedure:

function processSeq(s)
 {system("echo \"" s "\" |  tr ACGTacgt TGCAtgca | rev ");}

The argument of the system function is a string containing the command you would type into terminal to achieve the desired outcome (in this case I've used one of the example reverse-complement routines mentioned in the question). The parts to note are the correct escaping of quote marks that are to appear in the shell command, and the variable s that will be substituted for the sequence string assigned to it when the function is called. The value of s is concatenated with the strings quoted before and after it in the argument to system() shown above.

isolating the required sequences

The rest of the procedure addresses how to achieve:

"if the last character in the header has a C, do this"

Before making use of shell applications, awk needs to isolate the part(s) of the file to process. In general terms, awk employs one or more pattern/action blocks where only records (lines by default) that match a given pattern are processed by the subsequent action commands. For example, the following illustrative procedure performs the action of printing the whole line print $0 if the pattern /^>/ && /C$/ is true for that line (where /^>/ looks for ">" at the start of a line and /C$/ looks for "C" at the end of the same line.:

/^>/ && /C$/{ print $0 }

For the current needs, the sequence begins on the next record (line) after any record beginning with > and ending with C. One way of referencing that next line is to set a variable (named line in my example) when the C line is encountered and establishing a later pattern for the record with numerical value one more than line variable.

Because fasta sequences may extend over several lines, we have to accumulate several successive lines following a C title line. I have achieved this by concatenating each line following the C title line until a record beginning with > is encountered again (or until the end of the file is reached, using the END block).

In order that sequence lines following a non-C title line are ignored, I have used a variable named flag with values of either "do" or "ignore" set when a title record is encountered.

The call to a the custom function processSeq() that employs the system() command, is made at the beginning of a C title action block if the variable seq holds an accumulated sequence (and in the END block for relevant sequences that occur at the end of the file where there will be no title line).

Test file and procedure

A modified version of your example fasta was used to test the procedure. It contains an extra relevant C record with three and-a-bit lines instead of two, and an extra irrelevant record.

seq.fasta:

>chr1:86214203-86220231 
CTGGTGGTACAGCTACATTGTACCATAAAACTTATTCATATTAAAACTTA
TTTATATGTACCTCAAAAGATTAAACTGGGAGATAAGGTGTGGCATTTTT
>chr1:84518073-84524089C
caccttagagataatgaagtatattcagaatgtagaacattctataagac
aactgacccaatatcttttaaaaagtcaatgccatgttaaaaataaaaag
>chr1:86214203-86220231 
CTGGTGGTACAGCTACATTGTACCATAAAACTTATTCATATTAAAACTTA
TTTATATGTACCTCAAAAGATTAAACTGGGAGATAAGGTGTGGCATTTTT
>chranotherC
aatgaagtatattcagaatgtagaacattaactgacccgccatgttaatc
aatatctataagaccttttaaaaagcaccttagagattcaataaagtcag
gaagtatattcagaatgtagaacattaactgactaagaccttttaacatg
gcattgact

procedure

awk '

/^>/ && /C$/{
  if (length(seq)>0) {processSeq(seq); seq="";} 
  line=NR; print $0; flag="do"; next;
  } 
  
/^>/ {line=NR; flag="ignore"}
  
NR>1 && NR==(line 1) && (flag=="do"){seq=seq $0; line=NR; next} 
 
function processSeq(s)
 {system("echo \"" s "\" |  tr ACGTacgt TGCAtgca | rev ");}
 
END { if (length(seq)>0) processSeq(seq);} 

' seq.fasta

output

>chr1:84518073-84524089C
ctttttatttttaacatggcattgactttttaaaagatattgggtcagttgtcttatagaatgttctacattctgaatatacttcattatctctaaggtg
>chranotherC
agtcaatgccatgttaaaaggtcttagtcagttaatgttctacattctgaatatacttcctgactttattgaatctctaaggtgctttttaaaaggtcttatagatattgattaacatggcgggtcagttaatgttctacattctgaatatacttcatt

Tested using GNU Awk 5.1.0 on a Raspberry Pi 400.

  • Related