If I run an Athena query in AWS, the data I get back has structs with key/value pairs that look like this:
{
"events": "[{deviceType=Android,logins=400},{deviceType=iPhone,logins=550}]"
}
I can use regular expressions to parse this, but things like special characters make that de-serialization very error-prone.
For example, {deviceType=Android, date=2022-01-01}
will run into issues with delimiters if I use regex.
Is there an existing de-serializer for this type of thing?
EDIT:
This is the de-serialize regex I have:
def deserialize(s):
# Surround any word with "
s1 = re.sub('(\w )', '"\g<1>"', s)
# Replace = with :
s2 = re.sub('=', ':', s1)
return json.loads(s2)
This hits issues when there are special characters in the value like "-" or "." Regex isn't able to properly determine the "word", so doesn't place the enclosing quotes properly.
CodePudding user response:
Given the data as shown, you can isolate the strings between curly brackets with RE then further split those strings into their component parts. Here's an example:
import re
d = {'events': "[{deviceType=Android,logins=400},{deviceType=iPhone,logins=550}]"}
for t in re.findall('(?<={). ?(?=})', d['events']):
for p in t.split(','):
print(p)
Output:
deviceType=Android
logins=400
deviceType=iPhone
logins=550
CodePudding user response:
The data inside the quotes is almost JSON but it's missing the quotes around keys and values. With a few judiciously chained .replace()
method calls, you should be able to convert it from almost-JSON to JSON and then deserialize it using the json module:
import json
obj = {"events": "[{deviceType=Android, date=2022-01-01}]"}
events = obj['events']
events_json = events.replace(', ', ',').replace('{', '{"').replace('}', '"}').replace('=', '":"').replace(',', '","').replace('}","{','},{')
parsed = json.loads(events_json)
print(parsed[0])
print(parsed[0]['deviceType']) # prints 'Android'
print(parsed[0]['date']) # prints '2022-01-01'
*Edit to fix an issue raised by MisterMiyagi.