Home > Mobile >  Strange JSON-like data format in CSV
Strange JSON-like data format in CSV

Time:03-18

I have to work with a strange JSON-like format that crashes parsers because it is not quite JSON (no quotes, equals sign instead of colon etc).

Has anyone seen a data format like this and if so what is it?

"[{location_type=1, location_fullname=Papua New Guinea, location_countrycode=PP, location_adm1code=PP, location_adm2code=, location_latitude=-6, location_longitude=147, location_featureid=PP, character_offset=16},..."

It is nested inside a CSV structure so I am wondering if it is related to that.

Edit:

Full example

[{location_type=1, location_fullname=Papua New Guinea, location_countrycode=PP, location_adm1code=PP, location_adm2code=, location_latitude=-6, location_longitude=147, location_featureid=PP, character_offset=16}, {location_type=1, location_fullname=Papua New Guinea, location_countrycode=PP, location_adm1code=PP, location_adm2code=, location_latitude=-6, location_longitude=147, location_featureid=PP, character_offset=290}, {location_type=1, location_fullname=Indonesia, location_countrycode=ID, location_adm1code=ID, location_adm2code=, location_latitude=-5, location_longitude=120, location_featureid=ID, character_offset=676}, {location_type=1, location_fullname=North Korea, location_countrycode=KN, location_adm1code=KN, location_adm2code=, location_latitude=40, location_longitude=127, location_featureid=KN, character_offset=748}, {location_type=1, location_fullname=British Indian Ocean Territory, location_countrycode=IO, location_adm1code=IO, location_adm2code=, location_latitude=-6, location_longitude=71.5, location_featureid=IO, character_offset=892}]

Running this through as yaml with python gives the following:

import yaml
dct = yaml.safe_load(body)
dct
[{'location_type=1': None,
  'location_fullname=Papua New Guinea': None,
  'location_countrycode=PP': None,
...

CodePudding user response:

I would process what yaml is giving you since it is almost there:

import yaml 

ex='''
[{location_type=1, location_fullname=Papua New Guinea, location_countrycode=PP, location_adm1code=PP, location_adm2code=, location_latitude=-6, location_longitude=147, location_featureid=PP, character_offset=16}, {location_type=1, location_fullname=Papua New Guinea, location_countrycode=PP, location_adm1code=PP, location_adm2code=, location_latitude=-6, location_longitude=147, location_featureid=PP, character_offset=290}, {location_type=1, location_fullname=Indonesia, location_countrycode=ID, location_adm1code=ID, location_adm2code=, location_latitude=-5, location_longitude=120, location_featureid=ID, character_offset=676}, {location_type=1, location_fullname=North Korea, location_countrycode=KN, location_adm1code=KN, location_adm2code=, location_latitude=40, location_longitude=127, location_featureid=KN, character_offset=748}, {location_type=1, location_fullname=British Indian Ocean Territory, location_countrycode=IO, location_adm1code=IO, location_adm2code=, location_latitude=-6, location_longitude=71.5, location_featureid=IO, character_offset=892}]'''

def conv(s):
    try:
        return int(s)
    except ValueError:
        pass 
        
    try:
        return float(s)
    except ValueError:
        return s
        

res=[{x:conv(y) for x,y in map(lambda s: s.split('='), di)} 
       for di in yaml.load(ex, Loader=yaml.CLoader)]

>>> res
[{'location_type': 1, 'location_fullname': 'Papua New Guinea', 'location_countrycode': 'PP', 'location_adm1code': 'PP', 'location_adm2code': '', 'location_latitude': -6, 'location_longitude': 147, 'location_featureid': 'PP', 'character_offset': 16}, {'location_type': 1, 'location_fullname': 'Papua New Guinea', 'location_countrycode': 'PP', 'location_adm1code': 'PP', 'location_adm2code': '', 'location_latitude': -6, 'location_longitude': 147, 'location_featureid': 'PP', 'character_offset': 290}, {'location_type': 1, 'location_fullname': 'Indonesia', 'location_countrycode': 'ID', 'location_adm1code': 'ID', 'location_adm2code': '', 'location_latitude': -5, 'location_longitude': 120, 'location_featureid': 'ID', 'character_offset': 676}, {'location_type': 1, 'location_fullname': 'North Korea', 'location_countrycode': 'KN', 'location_adm1code': 'KN', 'location_adm2code': '', 'location_latitude': 40, 'location_longitude': 127, 'location_featureid': 'KN', 'character_offset': 748}, {'location_type': 1, 'location_fullname': 'British Indian Ocean Territory', 'location_countrycode': 'IO', 'location_adm1code': 'IO', 'location_adm2code': '', 'location_latitude': -6, 'location_longitude': 71.5, 'location_featureid': 'IO', 'character_offset': 892}]

CodePudding user response:

The string format seems simple enough and compatible enough for converting it to YAML or JSON

YAML convertion
import yaml

body = '[{location_type=1, location_fullname=Papua New Guinea, location_countrycode=PP, location_adm1code=PP, location_adm2code=, location_latitude=-6, location_longitude=147, location_featureid=PP, character_offset=16}, {location_type=1, location_fullname=Papua New Guinea, location_countrycode=PP, location_adm1code=PP, location_adm2code=, location_latitude=-6, location_longitude=147, location_featureid=PP, character_offset=290}, {location_type=1, location_fullname=Indonesia, location_countrycode=ID, location_adm1code=ID, location_adm2code=, location_latitude=-5, location_longitude=120, location_featureid=ID, character_offset=676}, {location_type=1, location_fullname=North Korea, location_countrycode=KN, location_adm1code=KN, location_adm2code=, location_latitude=40, location_longitude=127, location_featureid=KN, character_offset=748}, {location_type=1, location_fullname=British Indian Ocean Territory, location_countrycode=IO, location_adm1code=IO, location_adm2code=, location_latitude=-6, location_longitude=71.5, location_featureid=IO, character_offset=892}]'

dct = yaml.safe_load(body.replace('=',': '))
# dct
[
  {
    'location_type': 1,
    'location_fullname': 'Papua New Guinea',
    'location_countrycode': 'PP',
    'location_adm1code': 'PP',
    'location_adm2code': None,
    'location_latitude': -6,
    'location_longitude': 147,
    'location_featureid': 'PP',
    'character_offset': 16
  },
  {
    'location_type': 1,
    'location_fullname': 'Papua New Guinea',
    'location_countrycode': 'PP',
    'location_adm1code': 'PP',
    'location_adm2code': None,
    'location_latitude': -6,
    'location_longitude': 147,
    'location_featureid': 'PP',
    'character_offset': 290
  },
  {
    'location_type': 1,
    'location_fullname': 'Indonesia',
    'location_countrycode': 'ID',
    'location_adm1code': 'ID',
    'location_adm2code': None,
    'location_latitude': -5,
    'location_longitude': 120,
    'location_featureid': 'ID',
    'character_offset': 676
  },
  {
    'location_type': 1,
    'location_fullname': 'North Korea',
    'location_countrycode': 'KN',
    'location_adm1code': 'KN',
    'location_adm2code': None,
    'location_latitude': 40,
    'location_longitude': 127,
    'location_featureid': 'KN',
    'character_offset': 748
  },
  {
    'location_type': 1,
    'location_fullname': 'British Indian Ocean Territory',
    'location_countrycode': 'IO',
    'location_adm1code': 'IO',
    'location_adm2code': None,
    'location_latitude': -6,
    'location_longitude': 71.5,
    'location_featureid': 'IO',
    'character_offset': 892
  }
]

JSON convertion:

note: This one will need further conversion of the data types, like in @dawg answer

import re
import json

body = '[{location_type=1, location_fullname=Papua New Guinea, location_countrycode=PP, location_adm1code=PP, location_adm2code=, location_latitude=-6, location_longitude=147, location_featureid=PP, character_offset=16}, {location_type=1, location_fullname=Papua New Guinea, location_countrycode=PP, location_adm1code=PP, location_adm2code=, location_latitude=-6, location_longitude=147, location_featureid=PP, character_offset=290}, {location_type=1, location_fullname=Indonesia, location_countrycode=ID, location_adm1code=ID, location_adm2code=, location_latitude=-5, location_longitude=120, location_featureid=ID, character_offset=676}, {location_type=1, location_fullname=North Korea, location_countrycode=KN, location_adm1code=KN, location_adm2code=, location_latitude=40, location_longitude=127, location_featureid=KN, character_offset=748}, {location_type=1, location_fullname=British Indian Ocean Territory, location_countrycode=IO, location_adm1code=IO, location_adm2code=, location_latitude=-6, location_longitude=71.5, location_featureid=IO, character_offset=892}]'

dct = json.loads(re.sub('([^\s[{^=] )=([^,}]*)([,}\]])', '"\\1":"\\2"\\3', body))
# dct
[
  {
    'location_type': '1',
    'location_fullname': 'Papua New Guinea',
    'location_countrycode': 'PP',
    'location_adm1code': 'PP',
    'location_adm2code': '',
    'location_latitude': '-6',
    'location_longitude': '147',
    'location_featureid': 'PP',
    'character_offset': '16'
  },
  {
    'location_type': '1',
    'location_fullname': 'Papua New Guinea',
    'location_countrycode': 'PP',
    'location_adm1code': 'PP',
    'location_adm2code': '',
    'location_latitude': '-6',
    'location_longitude': '147',
    'location_featureid': 'PP',
    'character_offset': '290'
  },
  {
    'location_type': '1',
    'location_fullname': 'Indonesia',
    'location_countrycode': 'ID',
    'location_adm1code': 'ID',
    'location_adm2code': '',
    'location_latitude': '-5',
    'location_longitude': '120',
    'location_featureid': 'ID',
    'character_offset': '676'
  },
  {
    'location_type': '1',
    'location_fullname': 'North Korea',
    'location_countrycode': 'KN',
    'location_adm1code': 'KN',
    'location_adm2code': '',
    'location_latitude': '40',
    'location_longitude': '127',
    'location_featureid': 'KN',
    'character_offset': '748'
  },
  {
    'location_type': '1',
    'location_fullname': 'British Indian Ocean Territory',
    'location_countrycode': 'IO',
    'location_adm1code': 'IO',
    'location_adm2code': '',
    'location_latitude': '-6',
    'location_longitude': '71.5',
    'location_featureid': 'IO',
    'character_offset': '892'
  }
]
  • Related