I have to work with a strange JSON-like format that crashes parsers because it is not quite JSON (no quotes, equals sign instead of colon etc).
Has anyone seen a data format like this and if so what is it?
"[{location_type=1, location_fullname=Papua New Guinea, location_countrycode=PP, location_adm1code=PP, location_adm2code=, location_latitude=-6, location_longitude=147, location_featureid=PP, character_offset=16},..."
It is nested inside a CSV structure so I am wondering if it is related to that.
Edit:
Full example
[{location_type=1, location_fullname=Papua New Guinea, location_countrycode=PP, location_adm1code=PP, location_adm2code=, location_latitude=-6, location_longitude=147, location_featureid=PP, character_offset=16}, {location_type=1, location_fullname=Papua New Guinea, location_countrycode=PP, location_adm1code=PP, location_adm2code=, location_latitude=-6, location_longitude=147, location_featureid=PP, character_offset=290}, {location_type=1, location_fullname=Indonesia, location_countrycode=ID, location_adm1code=ID, location_adm2code=, location_latitude=-5, location_longitude=120, location_featureid=ID, character_offset=676}, {location_type=1, location_fullname=North Korea, location_countrycode=KN, location_adm1code=KN, location_adm2code=, location_latitude=40, location_longitude=127, location_featureid=KN, character_offset=748}, {location_type=1, location_fullname=British Indian Ocean Territory, location_countrycode=IO, location_adm1code=IO, location_adm2code=, location_latitude=-6, location_longitude=71.5, location_featureid=IO, character_offset=892}]
Running this through as yaml with python gives the following:
import yaml
dct = yaml.safe_load(body)
dct
[{'location_type=1': None,
'location_fullname=Papua New Guinea': None,
'location_countrycode=PP': None,
...
CodePudding user response:
I would process what yaml
is giving you since it is almost there:
import yaml
ex='''
[{location_type=1, location_fullname=Papua New Guinea, location_countrycode=PP, location_adm1code=PP, location_adm2code=, location_latitude=-6, location_longitude=147, location_featureid=PP, character_offset=16}, {location_type=1, location_fullname=Papua New Guinea, location_countrycode=PP, location_adm1code=PP, location_adm2code=, location_latitude=-6, location_longitude=147, location_featureid=PP, character_offset=290}, {location_type=1, location_fullname=Indonesia, location_countrycode=ID, location_adm1code=ID, location_adm2code=, location_latitude=-5, location_longitude=120, location_featureid=ID, character_offset=676}, {location_type=1, location_fullname=North Korea, location_countrycode=KN, location_adm1code=KN, location_adm2code=, location_latitude=40, location_longitude=127, location_featureid=KN, character_offset=748}, {location_type=1, location_fullname=British Indian Ocean Territory, location_countrycode=IO, location_adm1code=IO, location_adm2code=, location_latitude=-6, location_longitude=71.5, location_featureid=IO, character_offset=892}]'''
def conv(s):
try:
return int(s)
except ValueError:
pass
try:
return float(s)
except ValueError:
return s
res=[{x:conv(y) for x,y in map(lambda s: s.split('='), di)}
for di in yaml.load(ex, Loader=yaml.CLoader)]
>>> res
[{'location_type': 1, 'location_fullname': 'Papua New Guinea', 'location_countrycode': 'PP', 'location_adm1code': 'PP', 'location_adm2code': '', 'location_latitude': -6, 'location_longitude': 147, 'location_featureid': 'PP', 'character_offset': 16}, {'location_type': 1, 'location_fullname': 'Papua New Guinea', 'location_countrycode': 'PP', 'location_adm1code': 'PP', 'location_adm2code': '', 'location_latitude': -6, 'location_longitude': 147, 'location_featureid': 'PP', 'character_offset': 290}, {'location_type': 1, 'location_fullname': 'Indonesia', 'location_countrycode': 'ID', 'location_adm1code': 'ID', 'location_adm2code': '', 'location_latitude': -5, 'location_longitude': 120, 'location_featureid': 'ID', 'character_offset': 676}, {'location_type': 1, 'location_fullname': 'North Korea', 'location_countrycode': 'KN', 'location_adm1code': 'KN', 'location_adm2code': '', 'location_latitude': 40, 'location_longitude': 127, 'location_featureid': 'KN', 'character_offset': 748}, {'location_type': 1, 'location_fullname': 'British Indian Ocean Territory', 'location_countrycode': 'IO', 'location_adm1code': 'IO', 'location_adm2code': '', 'location_latitude': -6, 'location_longitude': 71.5, 'location_featureid': 'IO', 'character_offset': 892}]
CodePudding user response:
The string format seems simple enough and compatible enough for converting it to YAML or JSON
YAML convertion
import yaml
body = '[{location_type=1, location_fullname=Papua New Guinea, location_countrycode=PP, location_adm1code=PP, location_adm2code=, location_latitude=-6, location_longitude=147, location_featureid=PP, character_offset=16}, {location_type=1, location_fullname=Papua New Guinea, location_countrycode=PP, location_adm1code=PP, location_adm2code=, location_latitude=-6, location_longitude=147, location_featureid=PP, character_offset=290}, {location_type=1, location_fullname=Indonesia, location_countrycode=ID, location_adm1code=ID, location_adm2code=, location_latitude=-5, location_longitude=120, location_featureid=ID, character_offset=676}, {location_type=1, location_fullname=North Korea, location_countrycode=KN, location_adm1code=KN, location_adm2code=, location_latitude=40, location_longitude=127, location_featureid=KN, character_offset=748}, {location_type=1, location_fullname=British Indian Ocean Territory, location_countrycode=IO, location_adm1code=IO, location_adm2code=, location_latitude=-6, location_longitude=71.5, location_featureid=IO, character_offset=892}]'
dct = yaml.safe_load(body.replace('=',': '))
# dct
[
{
'location_type': 1,
'location_fullname': 'Papua New Guinea',
'location_countrycode': 'PP',
'location_adm1code': 'PP',
'location_adm2code': None,
'location_latitude': -6,
'location_longitude': 147,
'location_featureid': 'PP',
'character_offset': 16
},
{
'location_type': 1,
'location_fullname': 'Papua New Guinea',
'location_countrycode': 'PP',
'location_adm1code': 'PP',
'location_adm2code': None,
'location_latitude': -6,
'location_longitude': 147,
'location_featureid': 'PP',
'character_offset': 290
},
{
'location_type': 1,
'location_fullname': 'Indonesia',
'location_countrycode': 'ID',
'location_adm1code': 'ID',
'location_adm2code': None,
'location_latitude': -5,
'location_longitude': 120,
'location_featureid': 'ID',
'character_offset': 676
},
{
'location_type': 1,
'location_fullname': 'North Korea',
'location_countrycode': 'KN',
'location_adm1code': 'KN',
'location_adm2code': None,
'location_latitude': 40,
'location_longitude': 127,
'location_featureid': 'KN',
'character_offset': 748
},
{
'location_type': 1,
'location_fullname': 'British Indian Ocean Territory',
'location_countrycode': 'IO',
'location_adm1code': 'IO',
'location_adm2code': None,
'location_latitude': -6,
'location_longitude': 71.5,
'location_featureid': 'IO',
'character_offset': 892
}
]
JSON convertion:
note: This one will need further conversion of the data types, like in @dawg answer
import re
import json
body = '[{location_type=1, location_fullname=Papua New Guinea, location_countrycode=PP, location_adm1code=PP, location_adm2code=, location_latitude=-6, location_longitude=147, location_featureid=PP, character_offset=16}, {location_type=1, location_fullname=Papua New Guinea, location_countrycode=PP, location_adm1code=PP, location_adm2code=, location_latitude=-6, location_longitude=147, location_featureid=PP, character_offset=290}, {location_type=1, location_fullname=Indonesia, location_countrycode=ID, location_adm1code=ID, location_adm2code=, location_latitude=-5, location_longitude=120, location_featureid=ID, character_offset=676}, {location_type=1, location_fullname=North Korea, location_countrycode=KN, location_adm1code=KN, location_adm2code=, location_latitude=40, location_longitude=127, location_featureid=KN, character_offset=748}, {location_type=1, location_fullname=British Indian Ocean Territory, location_countrycode=IO, location_adm1code=IO, location_adm2code=, location_latitude=-6, location_longitude=71.5, location_featureid=IO, character_offset=892}]'
dct = json.loads(re.sub('([^\s[{^=] )=([^,}]*)([,}\]])', '"\\1":"\\2"\\3', body))
# dct
[
{
'location_type': '1',
'location_fullname': 'Papua New Guinea',
'location_countrycode': 'PP',
'location_adm1code': 'PP',
'location_adm2code': '',
'location_latitude': '-6',
'location_longitude': '147',
'location_featureid': 'PP',
'character_offset': '16'
},
{
'location_type': '1',
'location_fullname': 'Papua New Guinea',
'location_countrycode': 'PP',
'location_adm1code': 'PP',
'location_adm2code': '',
'location_latitude': '-6',
'location_longitude': '147',
'location_featureid': 'PP',
'character_offset': '290'
},
{
'location_type': '1',
'location_fullname': 'Indonesia',
'location_countrycode': 'ID',
'location_adm1code': 'ID',
'location_adm2code': '',
'location_latitude': '-5',
'location_longitude': '120',
'location_featureid': 'ID',
'character_offset': '676'
},
{
'location_type': '1',
'location_fullname': 'North Korea',
'location_countrycode': 'KN',
'location_adm1code': 'KN',
'location_adm2code': '',
'location_latitude': '40',
'location_longitude': '127',
'location_featureid': 'KN',
'character_offset': '748'
},
{
'location_type': '1',
'location_fullname': 'British Indian Ocean Territory',
'location_countrycode': 'IO',
'location_adm1code': 'IO',
'location_adm2code': '',
'location_latitude': '-6',
'location_longitude': '71.5',
'location_featureid': 'IO',
'character_offset': '892'
}
]