Home > Software engineering >  How to search and replace using Regex s/ when having to convert genome quality to ASCII?
How to search and replace using Regex s/ when having to convert genome quality to ASCII?

Time:11-01

I am struggling to convert a genome read quality from a fasta.qual file (40, 39, 38 etc) to ASCII using Phred 33 on Perl, but can't get it to work. I am trying to do it through the s///g operator. I have my qualities stored in a hash and I am trying to run the following loop:

foreach $key (keys %qual) {
  $value = $qual{$key};
  $qual{$key} =~ s/($value)/$map{$1}/g;
}

%map contains:

%map = ("0 " => "\!",
"1 " => "\"",
"2 " => "\#",
"3 " => "\$",
"4 " => "\%",
"5 " => "\&",
"6 " => "\'",
"7 " => "\(",
"8 " => "\)",
"9 " => "\*",
"10 " => "\ ",
"11 " => "\,",
"12 " => "\-",
"13 " => "\.",
"14 " => "\/",
"15 " => "0",
"16 " => "1",
"17 " => "2",
"18 " => "3",
"19 " => "4",
"20 " => "5",
"21 " => "6",
"22 " => "7",
"23 " => "8",
"24 " => "9",
"25 " => "\:",
"26 " => "\;",
"27 " => "\<",
"28 " => "\=",
"29 " => "\>",
"30 " => "\?",
"31 " => "\@",
"32 " => "A",
"33 " => "B",
"34 " => "C",
"35 " => "D",
"36 " => "E",
"37 " => "F",
"38 " => "G",
"39 " => "H",
"40 " => "I",);

It however transforms this:

>FR5ON5F01DQM9C 

37 37 37 37 37 37 40 40 40 40 40 40 40 40 40 40 40 35 35 35 40 40 40 40 40 40 40 40 40 40
40 40 40 37 37 37 37 37 37 37 37 37 37 37 37 37 37 37 37 37 37 37 37 37 36 36 30 30 30 30
30 38 37 37 37 37 37 37 37 37 37 37 37 37 37 37 37 37 37 37 37 37 37 37 37 37 37 37 37 37 

into this:

>FR5ON5F01DQM9C 



It happens for all the elements inside the hash. Is there anything I am doing wrong while applying the s/// operator?

The goal is to convert everything into a .fastq file.

Thank you!

CodePudding user response:

my $alt = join "|", map quotemeta, sort { length{$b} <=> length($a) } keys %map;
my $re = qr/($alt)/;

$str =~ s/$re/$map{$1}/g;

In retrospect, that doesn't take care of the spaces. It would make more sense to read in the sequence, then use

$seq = join "", map $map{$_}, split ' ', $seq;
  • Related