How to parse "{Key=Value}" structs into JSON with Python?-CodePudding

If I run an Athena query in AWS, the data I get back has structs with key/value pairs that look like this:

{
    "events": "[{deviceType=Android,logins=400},{deviceType=iPhone,logins=550}]"
}

I can use regular expressions to parse this, but things like special characters make that de-serialization very error-prone.

For example, {deviceType=Android, date=2022-01-01} will run into issues with delimiters if I use regex.

Is there an existing de-serializer for this type of thing?

EDIT:

This is the de-serialize regex I have:

def deserialize(s):
    # Surround any word with "
    s1 = re.sub('(\w )', '"\g<1>"', s)

    # Replace = with :
    s2 = re.sub('=', ':', s1)

    return json.loads(s2)

This hits issues when there are special characters in the value like "-" or "." Regex isn't able to properly determine the "word", so doesn't place the enclosing quotes properly.

CodePudding user response：

Given the data as shown, you can isolate the strings between curly brackets with RE then further split those strings into their component parts. Here's an example:

import re

d = {'events': "[{deviceType=Android,logins=400},{deviceType=iPhone,logins=550}]"}

for t in re.findall('(?<={). ?(?=})', d['events']):
    for p in t.split(','):
        print(p)

Output:

deviceType=Android
logins=400
deviceType=iPhone
logins=550

CodePudding user response：

The data inside the quotes is almost JSON but it's missing the quotes around keys and values. With a few judiciously chained .replace() method calls, you should be able to convert it from almost-JSON to JSON and then deserialize it using the json module:

import json
obj = {"events": "[{deviceType=Android, date=2022-01-01}]"}
events = obj['events']
events_json = events.replace(', ', ',').replace('{', '{"').replace('}', '"}').replace('=', '":"').replace(',', '","').replace('}","{','},{')
parsed = json.loads(events_json)
print(parsed[0])

print(parsed[0]['deviceType']) # prints 'Android'
print(parsed[0]['date']) # prints '2022-01-01'

*Edit to fix an issue raised by MisterMiyagi.