Home > Enterprise >  How to read json encoded in ibm437 in Ruby
How to read json encoded in ibm437 in Ruby

Time:03-20

I have a json file that has the following data in it:

{"help":true}

Platform in Windows 2016, when I open the text file in notepad the encoding shows as UCS-2 LE BOM and when I use ruby to display the encoding it is ibm437, when I try to parse the json it errors with the following:

ruby/2.5.0/json/common.rb:156:in `parse': 765: unexpected token at ' ■{' (JSON::ParserError)

My code is as follow:

require 'json'
def current_options
    dest='C:/test.json'
    file = File.read(dest)
    if(File.exist?(dest)) 
      p file.encoding
      p file
      @data_hash ||= JSON.parse(file)
      return @data_hash
    else
      return {}
    end
end

p current_options

And the output looks like this:

PS C:\> & "C:\ruby\bin\ruby.exe" .\ruby.rb #this is the file that contains my above code
#<Encoding:IBM437>
"\xFF\xFE{\x00\"\x00h\x00e\x00l\x00p\x00\"\x00:\x00t\x00r\x00u\x00e\x00}\x00"
Traceback (most recent call last):
        3: from ./ruby.rb:20:in `<main>'
        2: from ./ruby.rb:13:in `current_options'
        1: from C:/ruby/lib/ruby/2.5.0/json/common.rb:156:in `parse'
C:/ruby/lib/ruby/2.5.0/json/common.rb:156:in `parse': 765: unexpected token at ' ■{' (JSON::ParserError)

If I use notepad to change the encoding to utf-8 from UCS-2 LE BOM and then parse it in my code, it works without issues, the problem is that another application manages this file and creates it under that encoding format.

PS C:\> & "C:\ruby\bin\ruby.exe" .\ruby.rb #this is the file that contains my above code
#<Encoding:IBM437>
"{\"help\":true}"
{"help"=>true}

I tried specifying the encoding and forcing it to use utf-8 but it still fails:

require 'json'
def current_options
    dest='C:/test.json'
    file = File.read(dest,:external_encoding => 'ibm437',:internal_encoding => 'utf-8')
    if(File.exist?(dest)) 
      p file.encoding
      p file
      @data_hash ||= JSON.parse(file)
      return @data_hash
    else
      return {}
    end
end

p current_options

Will output this:

PS C:\> & "C:\ruby\bin\ruby.exe" .\ruby.rb #this is the file that contains my above code
#<Encoding:UTF-8>
"\u00A0\u25A0{\u0000\"\u0000h\u0000e\u0000l\u0000p\u0000\"\u0000:\u0000t\u0000r\u0000u\u0000e\u0000}\u0000"
Traceback (most recent call last):
        3: from ./ruby.rb:20:in `<main>'
        2: from ./ruby.rb:13:in `current_options'
        1: from C:/ruby/lib/ruby/2.5.0/json/common.rb:156:in `parse'
C:/ruby/lib/ruby/2.5.0/json/common.rb:156:in `parse': 765: unexpected token at ' ■{' (JSON::ParserError)

I am not sure how I can parse this file, any suggestions? Thank you,

CodePudding user response:

\u00A0 is a non-breaking space. \u25A0 is a black square. \u0000 is a null byte. These are not valid JSON characters. You'll have to strip or convert them.

It's probable Ruby guessed the encoding wrong and your file is not really IBM437 and is really UCS2-LE

CodePudding user response:

Your file really is in UCS2-LE with a BOM, so Notepad is telling you the truth.

Ruby does not attempt to figure out the encoding, as far as I know. When you do this:

file = File.read(dest)
if(File.exist?(dest)) 
    p file.encoding

What you see is not the encoding Ruby has deduced from the contents of the file. Rather, it is the OS default locale encoding. On USian OEM installs of Windows, the default encoding is IBM 437, ehich is the original DOS encoding. The actual encoding of the file is irrelevant.

You should be able to convert the file to UTF-8 by supplying external_encoding => 'utf-16' since the BOM provides endian information.

  • Related