I have a file that have an HTMl code, the HTML tags are encoded like the following content:
\x3cdiv data-name\x3d\x22region-name\x22 class\x3d\x22main-id\x22\x3eUK\x3c/div\x3e
The decoded HTML should be:
<div data-name="region-name" >UK</div>
In Ruby, I used cgi
library to unescapeHTML
however it does not work because when it read the content it does not identify the encoded tags, here is another example:
require 'cgi'
single_quoted_string = '\x3cdiv data-name\x3d\x22region-name\x22 class\x3d\x22main-id\x22\x3eUK\x3c/div\x3e'
double_quoted_string = "\x3cdiv data-name\x3d\x22region-name\x22 class\x3d\x22main-id\x22\x3eUK\x3c/div\x3e"
puts 'unescape single_quoted_string ' CGI.unescapeHTML(single_quoted_string)
puts 'unescape double_quoted_string ' CGI.unescapeHTML(double_quoted_string)
The output of the previous code is:
unescape single_quoted_string \x3cdiv data-name\x3d\x22region-name\x22 class\x3d\x22main-id\x22\x3eUK\x3c/div\x3e
unescape double_quoted_string <div data-name="region-name" >UK</div>
My question is, how can I make the single_quoted_string
act as if its content is double-quoted to make the function understand the encoded tags?
Thanks
CodePudding user response:
Ruby's parser allows certain escape sequences in string literals.
The double-quoted string literal "\x3c"
is recognized as containing a hexadecimal pattern \xnn
which represents the single character <
. (0x3C in ASCII)
The single-quoted string literal '\x3c'
however is treated literally, i.e. it represents four characters: \
, x
, 3
, and c
.
how can I make the
single_quoted_string
act as if its content is double-quoted
You can't. In order to turn these four characters into <
you have to parse the string yourself:
str = '\x3c'
str[2, 2] #=> "3c" take hex part
str[2, 2].hex #=> 60 convert to number
str[2, 2].hex.chr #=> "<" convert to character
You can apply this to gsub
:
str = '\x3cdiv data-name\x3d\x22region-name\x22 class\x3d\x22main-id\x22\x3eUK\x3c/div\x3e'
str.gsub(/\\x\h{2}/) { |m| m[2, 2].hex.chr }
#=> "<div data-name=\"region-name\" class=\"main-id\">UK</div>"
/\\x\h{2}/
matches a literal backslash (\\
) followed by x
and two ({2}
) hex characters (\h
).
Just for reference, a CGI encoded string would look like this:
str = "<div data-name=\"region-name\" class=\"main-id\">UK</div>"
CGI.escapeHTML(str)
#=> "<div data-name="region-name" class="main-id">UK</div>"
It uses &...;
style character references.
CodePudding user response:
Your problem has nothing to do with HTML, \x3c
represent the hex number '3c' in the ascii table.
Double-quoted strings look for this patterns and convert them to the desired value, single-quoted strings treat it the final outcome.
You can check for yourself that CGI is not doing anything.
CGI.unescapeHTML(double_quoted_string) == double_quoted_string
The easiest way I know to solve your problem is through gsub
def convert(str)
str.gsub(/\\x(\w\w)/) do
[Regexp.last_match(1)].pack("H*")
end
end
single_quoted_string = '\x3cdiv data-name\x3d\x22region-name\x22 class\x3d\x22main-id\x22\x3eUK\x3c/div\x3e'
puts convert(single_quoted_string)
What convert
does is to get every pair of hex escaped values and pack them as characters.