How to make a single-quoted string act like a double-quoted string in Ruby?-CodePudding

I have a file that have an HTMl code, the HTML tags are encoded like the following content:

\x3cdiv data-name\x3d\x22region-name\x22 class\x3d\x22main-id\x22\x3eUK\x3c/div\x3e

The decoded HTML should be:

<div data-name="region-name" >UK</div>

In Ruby, I used cgi library to unescapeHTML however it does not work because when it read the content it does not identify the encoded tags, here is another example:

require 'cgi'

single_quoted_string = '\x3cdiv data-name\x3d\x22region-name\x22 class\x3d\x22main-id\x22\x3eUK\x3c/div\x3e'
double_quoted_string = "\x3cdiv data-name\x3d\x22region-name\x22 class\x3d\x22main-id\x22\x3eUK\x3c/div\x3e"


puts 'unescape single_quoted_string '   CGI.unescapeHTML(single_quoted_string)
puts 'unescape double_quoted_string '   CGI.unescapeHTML(double_quoted_string)

The output of the previous code is:

unescape single_quoted_string \x3cdiv data-name\x3d\x22region-name\x22 class\x3d\x22main-id\x22\x3eUK\x3c/div\x3e
unescape double_quoted_string <div data-name="region-name" >UK</div>

My question is, how can I make the single_quoted_string act as if its content is double-quoted to make the function understand the encoded tags?

Thanks

CodePudding user response：

Ruby's parser allows certain escape sequences in string literals.

The double-quoted string literal "\x3c" is recognized as containing a hexadecimal pattern \xnn which represents the single character <. (0x3C in ASCII)

The single-quoted string literal '\x3c' however is treated literally, i.e. it represents four characters: \, x, 3, and c.

how can I make the single_quoted_string act as if its content is double-quoted

You can't. In order to turn these four characters into < you have to parse the string yourself:

str = '\x3c'

str[2, 2]         #=> "3c"  take hex part
str[2, 2].hex     #=> 60    convert to number
str[2, 2].hex.chr #=> "<"   convert to character

You can apply this to gsub:

str = '\x3cdiv data-name\x3d\x22region-name\x22 class\x3d\x22main-id\x22\x3eUK\x3c/div\x3e'

str.gsub(/\\x\h{2}/) { |m| m[2, 2].hex.chr }
#=> "<div data-name=\"region-name\" class=\"main-id\">UK</div>"

/\\x\h{2}/ matches a literal backslash (\\) followed by x and two ({2}) hex characters (\h).

Just for reference, a CGI encoded string would look like this:

str = "<div data-name=\"region-name\" class=\"main-id\">UK</div>"

CGI.escapeHTML(str)
#=> "&lt;div data-name=&quot;region-name&quot; class=&quot;main-id&quot;&gt;UK&lt;/div&gt;"

It uses &...; style character references.

CodePudding user response：

Your problem has nothing to do with HTML, \x3c represent the hex number '3c' in the ascii table. Double-quoted strings look for this patterns and convert them to the desired value, single-quoted strings treat it the final outcome.

You can check for yourself that CGI is not doing anything.

CGI.unescapeHTML(double_quoted_string) == double_quoted_string

The easiest way I know to solve your problem is through gsub

def convert(str)
  str.gsub(/\\x(\w\w)/) do
    [Regexp.last_match(1)].pack("H*")
  end
end

single_quoted_string = '\x3cdiv data-name\x3d\x22region-name\x22 class\x3d\x22main-id\x22\x3eUK\x3c/div\x3e'

puts convert(single_quoted_string)

What convert does is to get every pair of hex escaped values and pack them as characters.