Home > Net >  Remove duplicates in multifasta, where entries are paired
Remove duplicates in multifasta, where entries are paired

Time:02-03

Hi my input looks like:

>ref 
GGTGCCCACACTAATGATGTAAAACAATTAACAGAGGCAGTGCAAA
>sample1 
GGTGCCCACACTAATGATGTAAAACAATTAACAGAGGCAGTGCAAA
>ref 
GGTTAGGGCCGCCTGTTGGTGGGCGGGAATCAAGCAGCATTTTGGAATTCCCTACAAT
>sample2 
GGTTAGGGCCGCCTGTTGGTGGGCGGGAATCAAGCAGGTATTTGGAATTCCCTACAAT

The entries in the fasta file are paired so that the ref is paired with the sample# below it.

I want to identify where the nt seqeunce for sample# and ref are identical, and remove them from the fasta (or put them into another fasta file of their own). The output would hopefully be a fasta file where the nt sequences for refs and sample# are different.

So far I have tried seqkit rmdup command, however, this doesn't treat the entries as if they are paired. How can I accomplish this, ideally with a bash command or other program.

CodePudding user response:

This awk script seems to solve your problem. If it does, please mark this as correct. If it does not, please post a comment.

Mac_3.2.57$cat fastaScrubber-v0.awk
{
  if(NR%4==2){
    ref=$0
  }else if(NR%4==3){
    samN=$0
  }else if(NR%4==0){
    sam=$0
    if(sam!=ref){
      printf(">ref\n%s\n%s\n%s\n",ref ,samN ,sam)
    }
  }
}
Mac_3.2.57$cat fasta0 | awk '{if(NR%4==2||NR%4==0){print}}' | uniq -c
   2 GGTGCCCACACTAATGATGTAAAACAATTAACAGAGGCAGTGCAAA
   1 GGTTAGGGCCGCCTGTTGGTGGGCGGGAATCAAGCAGCATTTTGGAATTCCCTACAAT
   1 GGTTAGGGCCGCCTGTTGGTGGGCGGGAATCAAGCAGGTATTTGGAATTCCCTACAAT
   2 GGTGCCCACACTAATGATGTAAAACAATTAACAGAGGCAGTGCAAB
   1 GGTTAGGGCCGCCTGTTGGTGGGCGGGAATCAAGCAGCATTTTGGAATTCCCTACAAB
   1 GGTTAGGGCCGCCTGTTGGTGGGCGGGAATCAAGCAGGTATTTGGAATTCCCTACAAB
   1 GGTGCCCACACTAATGATGTAAAACAATTAACAGAGGCAGTGCAXC
   1 GGTGCCCACACTAATGATGTAAAACAATTAACAGAGGCAGTGCAAC
   1 GGTTAGGGCCGCCTGTTGGTGGGCGGGAATCAAGCAGCATTTTGGAATTCCCTACAXC
   1 GGTTAGGGCCGCCTGTTGGTGGGCGGGAATCAAGCAGGTATTTGGAATTCCCTACAAC
Mac_3.2.57$awk -f fastaScrubber-v0.awk fasta0
>ref
GGTTAGGGCCGCCTGTTGGTGGGCGGGAATCAAGCAGCATTTTGGAATTCCCTACAAT
>sample2 
GGTTAGGGCCGCCTGTTGGTGGGCGGGAATCAAGCAGGTATTTGGAATTCCCTACAAT
>ref
GGTTAGGGCCGCCTGTTGGTGGGCGGGAATCAAGCAGCATTTTGGAATTCCCTACAAB
>sample4 
GGTTAGGGCCGCCTGTTGGTGGGCGGGAATCAAGCAGGTATTTGGAATTCCCTACAAB
>ref
GGTGCCCACACTAATGATGTAAAACAATTAACAGAGGCAGTGCAXC
>sample5 
GGTGCCCACACTAATGATGTAAAACAATTAACAGAGGCAGTGCAAC
>ref
GGTTAGGGCCGCCTGTTGGTGGGCGGGAATCAAGCAGCATTTTGGAATTCCCTACAXC
>sample6 
GGTTAGGGCCGCCTGTTGGTGGGCGGGAATCAAGCAGGTATTTGGAATTCCCTACAAC
Mac_3.2.57$cat fasta0
>ref 
GGTGCCCACACTAATGATGTAAAACAATTAACAGAGGCAGTGCAAA
>sample1 
GGTGCCCACACTAATGATGTAAAACAATTAACAGAGGCAGTGCAAA
>ref 
GGTTAGGGCCGCCTGTTGGTGGGCGGGAATCAAGCAGCATTTTGGAATTCCCTACAAT
>sample2 
GGTTAGGGCCGCCTGTTGGTGGGCGGGAATCAAGCAGGTATTTGGAATTCCCTACAAT
>ref 
GGTGCCCACACTAATGATGTAAAACAATTAACAGAGGCAGTGCAAB
>sample3 
GGTGCCCACACTAATGATGTAAAACAATTAACAGAGGCAGTGCAAB
>ref 
GGTTAGGGCCGCCTGTTGGTGGGCGGGAATCAAGCAGCATTTTGGAATTCCCTACAAB
>sample4 
GGTTAGGGCCGCCTGTTGGTGGGCGGGAATCAAGCAGGTATTTGGAATTCCCTACAAB
>ref 
GGTGCCCACACTAATGATGTAAAACAATTAACAGAGGCAGTGCAXC
>sample5 
GGTGCCCACACTAATGATGTAAAACAATTAACAGAGGCAGTGCAAC
>ref 
GGTTAGGGCCGCCTGTTGGTGGGCGGGAATCAAGCAGCATTTTGGAATTCCCTACAXC
>sample6 
GGTTAGGGCCGCCTGTTGGTGGGCGGGAATCAAGCAGGTATTTGGAATTCCCTACAAC
Mac_3.2.57$

CodePudding user response:

Here is one way to do it:
Make a list from every two records (4 lines), each line an element of the list. Then compare DNA sequences in each list. If they are not equal, print the ref and sample.

Note: This script assumes that your FASTA record is a two line record (as in your example), not wrapped into multi line FASTA. If you have wrapped FASTA record, you need to convert it to two line records, or I suggest you parse your FASTA file with Biopython Module.

counter =0 
my_records =[]
with open("input.fasta") as f:
    for line in f:
        counter =1
        my_records.append(line.strip())

        if counter % 4 == 0:
            if my_records[1] != my_records[3]:
                [print(item) for item in my_records]
            my_records =[]
  • Related