Home > OS >  Regular expression find and replace with quoted version of self
Regular expression find and replace with quoted version of self

Time:07-25

I have some string data that I would like to covert to JSON. The strings use very non standard delimiters and have have hierarchical keys and values.

I can currently get close the converting the strings to JSON like format that can be read with json.loads() into a dict, however I am stuck in converting the "keys" to the double quoted versions of themselves.

Here is the string

test_string = """
APPLES12:10.000^5.1234V6.456V8.111V4.222V10.000V20.000V20.12347V25.000%5.000^10.1234V16.456V15.111V5.222V15.000V15.000V6.000V25.000_BANNAS34:5.000^4.123V4.123V4.123V4.123V4.123V4.123V4.123V4.123%4.800^5.123V4.123V5.123V6.123V4.123V6.123V7.123V4.123_GRAPES:10.00^3.125%5.00^4.345%3.00^10.111_PEARS:10.00^3.123%5.000^4.234%3.000^5.67
"""

And here is what I have so far:

# copy the string into a new one
new_string = test_string

# convert (almost) to json format to be read as a dict
replace_dict = {
    ':': ':[\n',
    'V': ',',
    '^': ':[',
   '%': '],\n',
    '_': ']],\n',
}

# convert separaters to make it JSON like
for k, v in replace_dict.items():
    new_string = new_string.replace(k, v)

# remove trauling newline add on trailing bracket closure
new_string = new_string.strip('\n') ']]'

print(new_string)

Which returns a more readable "close" to JSON format:

APPLES12:[
10.000:[5.1234,6.456,8.111,4.222,10.000,20.000,20.12347,25.000],
5.000:[10.1234,16.456,15.111,5.222,15.000,15.000,6.000,25.000]],
BANNAS34:[
5.000:[4.123,4.123,4.123,4.123,4.123,4.123,4.123,4.123],
4.800:[5.123,4.123,5.123,6.123,4.123,6.123,7.123,4.123]],
GRAPES:[
10.00:[3.125],
5.00:[4.345],
3.00:[10.111]],
PEARS:[
10.00:[3.123],
5.000:[4.234],
3.000:[5.67]]

Where I have replaced chars and added new lines to make it more readable. I can "find" all the implicit keys with this expression:

# find all the implicit keys - these must be double quoted per JSON spec
re_pattern = r'^[^:-][^:]*'
keys = re.findall(re_pattern, new_string, re.M)
print(keys)

Which returns:

['APPLES12', '10.000', '5.000', 'BANNAS34', '5.000', '4.800', 'GRAPES', '10.00', '5.00', '3.00', 'PEARS', '10.00', '5.000', '3.000']

Rather that just "finding" all the keys, I need to replace them with the quoted version of themselves (and then add the curly braces) to make it a JSON string than can be converted to a python dict with json.loads()

I am open to other methods of replacement too.. the newlines \n's were "added" only for intermediate readability to get keys on separate lines.

Speed is important, as this will be applied on a pandas data where there are millions of string like this.

CodePudding user response:

You're making this harder on yourself than it needs to be. Make a nested dictionary and the json module will do the rest.

Start with an empty dictionary:

mydict = {}

The top level objects are separated by underscores, with keys split on colons:

for obj in test_string.split('_'):
    key, content = obj.split(':', 1)

Nested objects are separated by %, with keys split by ^ and list elements by V. So the nested loop will be

    nested = {}
    for item in content.split('%'):
        k, v = item.split('^')
        nested[k] = v.split('V')
    mydict[key] = nested

Now you can pass the result to json.dumps or any other serializer, which will handle all the quoting and formatting for you correctly.

If you want floats instead of strings in the nested data:

    for item in content.split('%'):
        k, v = item.split('^')
        nested[float(k)] = list(map(float, v.split('V')))
  • Related