Hi my input looks like:
>ref
GGTGCCCACACTAATGATGTAAAACAATTAACAGAGGCAGTGCAAA
>sample1
GGTGCCCACACTAATGATGTAAAACAATTAACAGAGGCAGTGCAAA
>ref
GGTTAGGGCCGCCTGTTGGTGGGCGGGAATCAAGCAGCATTTTGGAATTCCCTACAAT
>sample2
GGTTAGGGCCGCCTGTTGGTGGGCGGGAATCAAGCAGGTATTTGGAATTCCCTACAAT
The entries in the fasta file are paired so that the ref is paired with the sample# below it.
I want to identify where the nt seqeunce for sample# and ref are identical, and remove them from the fasta (or put them into another fasta file of their own). The output would hopefully be a fasta file where the nt sequences for refs and sample# are different.
So far I have tried seqkit rmdup command, however, this doesn't treat the entries as if they are paired. How can I accomplish this, ideally with a bash command or other program.
CodePudding user response:
This awk script seems to solve your problem. If it does, please mark this as correct. If it does not, please post a comment.
Mac_3.2.57$cat fastaScrubber-v0.awk
{
if(NR%4==2){
ref=$0
}else if(NR%4==3){
samN=$0
}else if(NR%4==0){
sam=$0
if(sam!=ref){
printf(">ref\n%s\n%s\n%s\n",ref ,samN ,sam)
}
}
}
Mac_3.2.57$cat fasta0 | awk '{if(NR%4==2||NR%4==0){print}}' | uniq -c
2 GGTGCCCACACTAATGATGTAAAACAATTAACAGAGGCAGTGCAAA
1 GGTTAGGGCCGCCTGTTGGTGGGCGGGAATCAAGCAGCATTTTGGAATTCCCTACAAT
1 GGTTAGGGCCGCCTGTTGGTGGGCGGGAATCAAGCAGGTATTTGGAATTCCCTACAAT
2 GGTGCCCACACTAATGATGTAAAACAATTAACAGAGGCAGTGCAAB
1 GGTTAGGGCCGCCTGTTGGTGGGCGGGAATCAAGCAGCATTTTGGAATTCCCTACAAB
1 GGTTAGGGCCGCCTGTTGGTGGGCGGGAATCAAGCAGGTATTTGGAATTCCCTACAAB
1 GGTGCCCACACTAATGATGTAAAACAATTAACAGAGGCAGTGCAXC
1 GGTGCCCACACTAATGATGTAAAACAATTAACAGAGGCAGTGCAAC
1 GGTTAGGGCCGCCTGTTGGTGGGCGGGAATCAAGCAGCATTTTGGAATTCCCTACAXC
1 GGTTAGGGCCGCCTGTTGGTGGGCGGGAATCAAGCAGGTATTTGGAATTCCCTACAAC
Mac_3.2.57$awk -f fastaScrubber-v0.awk fasta0
>ref
GGTTAGGGCCGCCTGTTGGTGGGCGGGAATCAAGCAGCATTTTGGAATTCCCTACAAT
>sample2
GGTTAGGGCCGCCTGTTGGTGGGCGGGAATCAAGCAGGTATTTGGAATTCCCTACAAT
>ref
GGTTAGGGCCGCCTGTTGGTGGGCGGGAATCAAGCAGCATTTTGGAATTCCCTACAAB
>sample4
GGTTAGGGCCGCCTGTTGGTGGGCGGGAATCAAGCAGGTATTTGGAATTCCCTACAAB
>ref
GGTGCCCACACTAATGATGTAAAACAATTAACAGAGGCAGTGCAXC
>sample5
GGTGCCCACACTAATGATGTAAAACAATTAACAGAGGCAGTGCAAC
>ref
GGTTAGGGCCGCCTGTTGGTGGGCGGGAATCAAGCAGCATTTTGGAATTCCCTACAXC
>sample6
GGTTAGGGCCGCCTGTTGGTGGGCGGGAATCAAGCAGGTATTTGGAATTCCCTACAAC
Mac_3.2.57$cat fasta0
>ref
GGTGCCCACACTAATGATGTAAAACAATTAACAGAGGCAGTGCAAA
>sample1
GGTGCCCACACTAATGATGTAAAACAATTAACAGAGGCAGTGCAAA
>ref
GGTTAGGGCCGCCTGTTGGTGGGCGGGAATCAAGCAGCATTTTGGAATTCCCTACAAT
>sample2
GGTTAGGGCCGCCTGTTGGTGGGCGGGAATCAAGCAGGTATTTGGAATTCCCTACAAT
>ref
GGTGCCCACACTAATGATGTAAAACAATTAACAGAGGCAGTGCAAB
>sample3
GGTGCCCACACTAATGATGTAAAACAATTAACAGAGGCAGTGCAAB
>ref
GGTTAGGGCCGCCTGTTGGTGGGCGGGAATCAAGCAGCATTTTGGAATTCCCTACAAB
>sample4
GGTTAGGGCCGCCTGTTGGTGGGCGGGAATCAAGCAGGTATTTGGAATTCCCTACAAB
>ref
GGTGCCCACACTAATGATGTAAAACAATTAACAGAGGCAGTGCAXC
>sample5
GGTGCCCACACTAATGATGTAAAACAATTAACAGAGGCAGTGCAAC
>ref
GGTTAGGGCCGCCTGTTGGTGGGCGGGAATCAAGCAGCATTTTGGAATTCCCTACAXC
>sample6
GGTTAGGGCCGCCTGTTGGTGGGCGGGAATCAAGCAGGTATTTGGAATTCCCTACAAC
Mac_3.2.57$
CodePudding user response:
Here is one way to do it:
Make a list from every two records (4 lines), each line an element of the list. Then compare DNA sequences in each list. If they are not equal, print the ref and sample.
Note: This script assumes that your FASTA record is a two line record (as in your example), not wrapped into multi line FASTA. If you have wrapped FASTA record, you need to convert it to two line records, or I suggest you parse your FASTA file with Biopython Module.
counter =0
my_records =[]
with open("input.fasta") as f:
for line in f:
counter =1
my_records.append(line.strip())
if counter % 4 == 0:
if my_records[1] != my_records[3]:
[print(item) for item in my_records]
my_records =[]