Home > Blockchain >  Python string formatting using regex
Python string formatting using regex

Time:10-01

I am new to regex ,I have a string in this format:

IhaveThisString = {WebhookName:webhook,RequestBody:somebody,RequestHeader:{emailCallBackUrl:https://yyyy-xx.zzzzz.logic.azure.com/workflows/efdbb900c/runs/00268387CU20/actions/HTTP_Webhook/repetitions/000000/run?api-version=2050-02-04&sp=/runs/08sssssssssss0/actions/HTTP_Webhook/repetitions/000000/run,/runs/xxxxxxxxxxxxxxxxx20/actions/HTTP_Webhook/repetitions/000000/read&sv=1.0&sig=eYmmxxxxxxxxxxxxxxxxxxxhjIM,emailFileContent:hDQphc2xxxxxxxxxhpc2gsbWlzaHJhDQo=,emailFileName:456.csv,emailFrom:[email protected],emailSubject:Example,x-ms-workflow-id:6cxxxxxxxxxxxxxxbb900c,x-ms-workflow-version:08xxxxxxxxxxxxxxxx150,x-ms-workflow-name:runbook,x-ms-workflow-system-id:/locations/cxxxxx/scaleunits/xxxxx/workflows/6c14xxxxxxxxxxxxxxxb900c,x-ms-workflow-run-id:08xxxxxxxxxxxxx0,x-ms-workflow-run-tracking-id:c0889f0a-8ef9-5555-x111-77ldkfw98r34c54,x-ms-workflow-operation-name:HTTP_Webhook,x-ms-workflow-repeatitem-scope-name:For_each,x-ms-workflow-repeatitem-index:0,x-ms-workflow-repeatitem-batch-index:0,x-ms-execution-location:xxxxxxxx,x-ms-workflow-subscription-id:hfhfh-d6s6d-d7d9s-7ASassASasas4,x-ms-workflow-resourcegroup-name:rg_poc,x-ms-tracking-id:c3-5xxxx-xxxxx-asdsd-2xxxx21,x-ms-correlation-id:c3xxxxx-5xxxx-4xxxx-xxxxx-25xxxxx,x-ms-client-request-id:cxxxxxx-xxxx-4xxxx-axxx-2sssssz21,x-ms-client-tracking-id:08wwsswsCU16,x-ms-action-tracking-id:b1sdsd6-85sdsd-4fsdsd-sddc-988888888891,x-ms-zone-redundancy:optional,x-ms-activity-vector:AB.0L.OU.23,Connection:Keep-Alive,Accept-Encoding:gzip,Accept-Language:en,Host:xxxxxxxxxxx0418b.webhook.eus.azure-automation.net,User-Agent:azure-logic-apps/1.0}}"

I need to use json.load to convert above into json_object.

json_object = json.loads(IhaveThisString)

But problem is single quote are missing in my string before and after colon.

I need to reformat the string like this:

{'WebhookName':'webhook','RequestBody':'somebody','RequestHeader':{'emailCallBackUrl':'https://yyyy-xx.zzzzz.logic.azure.com/workflows/efdbb900c/runs/00268387CU20/actions/HTTP_Webhook/repetitions/000000/run?api-version=2050-02-04&sp=/runs/08sssssssssss0/actions/HTTP_Webhook/repetitions/000000/run,/runs/xxxxxxxxxxxxxxxxx20/actions/HTTP_Webhook/repetitions/000000/read&sv=1.0&sig=eYmmxxxxxxxxxxxxxxxxxxxhjIM','emailFileContent':'hDQphc2xxxxxxxxxhpc2gsbWlzaHJhDQo=','emailFileName':'456.csv','emailFrom':'[email protected]','emailSubject':'Example','x-ms-workflow-id':'6cxxxxxxxxxxxxxxbb900c','x-ms-workflow-version':'08xxxxxxxxxxxxxxxx150','x-ms-workflow-name':'runbook','x-ms-workflow-system-id':'/locations/cxxxxx/scaleunits/xxxxx/workflows/6c14xxxxxxxxxxxxxxxb900c','x-ms-workflow-run-id':'08xxxxxxxxxxxxx0','x-ms-workflow-run-tracking-id':'c0889f0a-8ef9-5555-x111-77ldkfw98r34c54','x-ms-workflow-operation-name':'HTTP_Webhook','x-ms-workflow-repeatitem-scope-name':'For_each','x-ms-workflow-repeatitem-index':'0','x-ms-workflow-repeatitem-batch-index':'0','x-ms-execution-location':'xxxxxxxx','x-ms-workflow-subscription-id':'hfhfh-d6s6d-d7d9s-7ASassASasas4','x-ms-workflow-resourcegroup-name':'rg_poc','x-ms-tracking-id':'c3-5xxxx-xxxxx-asdsd-2xxxx21','x-ms-correlation-id':'c3xxxxx-5xxxx-4xxxx-xxxxx-25xxxxx','x-ms-client-request-id':'cxxxxxx-xxxx-4xxxx-axxx-2sssssz21','x-ms-client-tracking-id':'08wwsswsCU16','x-ms-action-tracking-id':'b1sdsd6-85sdsd-4fsdsd-sddc-988888888891','x-ms-zone-redundancy':'optional','x-ms-activity-vector':'AB.0L.OU.23','Connection':'Keep-Alive','Accept-Encoding':'gzip','Accept-Language':'en','Host':'xxxxxxxxxxx0418b.webhook.eus.azure-automation.net','User-Agent':'azure-logic-apps/1.0'}}"

Please let me know how to use re in python to achieve the same.

note: given data have URL like emailCallBackUrl:https://yyyy-xx.zzzzz which should get converted to like 'emailCallBackUrl':'https://yyyy-xx.zzzzz'

CodePudding user response:

This was a fun one to solve, thanks for the clear explanation and example input!

The basic approach is to find sequences of characters that are not delimiter symbols (:, {, }, ,):

[^{:},] 

However this doesn't quite work as the https: gets split up from the rest of the URL (see here).

This can be resolved by allowing for an optional https: inside the capturing group:

((?:https:)?[^{:},] )

See the regex working here. Of course you could add any other exceptions you need to in this way (e.g. ((?:https:|http:)?[^{:},] ) to also capture http:.

Full Python code:

IhaveThisString = "{WebhookName:webhook,RequestBody:somebody,RequestHeader:{emailCallBackUrl:https://yyyy-xx.zzzzz.logic.azure.com/workflows/efdbb900c/runs/00268387CU20/actions/HTTP_Webhook/repetitions/000000/run?api-version=2050-02-04&sp=/runs/08sssssssssss0/actions/HTTP_Webhook/repetitions/000000/run,/runs/xxxxxxxxxxxxxxxxx20/actions/HTTP_Webhook/repetitions/000000/read&sv=1.0&sig=eYmmxxxxxxxxxxxxxxxxxxxhjIM,emailFileContent:hDQphc2xxxxxxxxxhpc2gsbWlzaHJhDQo=,emailFileName:456.csv,emailFrom:[email protected],emailSubject:Example,x-ms-workflow-id:6cxxxxxxxxxxxxxxbb900c,x-ms-workflow-version:08xxxxxxxxxxxxxxxx150,x-ms-workflow-name:runbook,x-ms-workflow-system-id:/locations/cxxxxx/scaleunits/xxxxx/workflows/6c14xxxxxxxxxxxxxxxb900c,x-ms-workflow-run-id:08xxxxxxxxxxxxx0,x-ms-workflow-run-tracking-id:c0889f0a-8ef9-5555-x111-77ldkfw98r34c54,x-ms-workflow-operation-name:HTTP_Webhook,x-ms-workflow-repeatitem-scope-name:For_each,x-ms-workflow-repeatitem-index:0,x-ms-workflow-repeatitem-batch-index:0,x-ms-execution-location:xxxxxxxx,x-ms-workflow-subscription-id:hfhfh-d6s6d-d7d9s-7ASassASasas4,x-ms-workflow-resourcegroup-name:rg_poc,x-ms-tracking-id:c3-5xxxx-xxxxx-asdsd-2xxxx21,x-ms-correlation-id:c3xxxxx-5xxxx-4xxxx-xxxxx-25xxxxx,x-ms-client-request-id:cxxxxxx-xxxx-4xxxx-axxx-2sssssz21,x-ms-client-tracking-id:08wwsswsCU16,x-ms-action-tracking-id:b1sdsd6-85sdsd-4fsdsd-sddc-988888888891,x-ms-zone-redundancy:optional,x-ms-activity-vector:AB.0L.OU.23,Connection:Keep-Alive,Accept-Encoding:gzip,Accept-Language:en,Host:xxxxxxxxxxx0418b.webhook.eus.azure-automation.net,User-Agent:azure-logic-apps/1.0}}"

import re
import json
import pprint

with_quotes = re.sub(r'((?:https:)?[^{:},] )', r'"\1"', IhaveThisString)
my_json = json.loads(with_quotes)
pp = pprint.PrettyPrinter(depth=4)
pp.pprint(my_json)

Output:

{'RequestBody': 'somebody',
'RequestHeader': {'Accept-Encoding': 'gzip',
                'Accept-Language': 'en',
                'Connection': 'Keep-Alive',
                'Host': 'xxxxxxxxxxx0418b.webhook.eus.azure-automation.net',
                'User-Agent': 'azure-logic-apps/1.0',
                'emailCallBackUrl': 'https://yyyy-xx.zzzzz.logic.azure.com/workflows/efdbb900c/runs/00268387CU20/actions/HTTP_Webhook/repetitions/000000/run?api-version=2050-02-04&sp=/runs/08sssssssssss0/actions/HTTP_Webhook/repetitions/000000/run,/runs/xxxxxxxxxxxxxxxxx20/actions/HTTP_Webhook/repetitions/000000/read&sv=1.0&sig=eYmmxxxxxxxxxxxxxxxxxxxhjIM',
                'emailFileContent': 'hDQphc2xxxxxxxxxhpc2gsbWlzaHJhDQo=',
                'emailFileName': '456.csv',
                'emailFrom': '[email protected]',
                'emailSubject': 'Example',
                'x-ms-action-tracking-id': 'b1sdsd6-85sdsd-4fsdsd-sddc-988888888891',
                'x-ms-activity-vector': 'AB.0L.OU.23',
                'x-ms-client-request-id': 'cxxxxxx-xxxx-4xxxx-axxx-2sssssz21',
                'x-ms-client-tracking-id': '08wwsswsCU16',
                'x-ms-correlation-id': 'c3xxxxx-5xxxx-4xxxx-xxxxx-25xxxxx',
                'x-ms-execution-location': 'xxxxxxxx',
                'x-ms-tracking-id': 'c3-5xxxx-xxxxx-asdsd-2xxxx21',
                'x-ms-workflow-id': '6cxxxxxxxxxxxxxxbb900c',
                'x-ms-workflow-name': 'runbook',
                'x-ms-workflow-operation-name': 'HTTP_Webhook',
                'x-ms-workflow-repeatitem-batch-index': '0',
                'x-ms-workflow-repeatitem-index': '0',
                'x-ms-workflow-repeatitem-scope-name': 'For_each',
                'x-ms-workflow-resourcegroup-name': 'rg_poc',
                'x-ms-workflow-run-id': '08xxxxxxxxxxxxx0',
                'x-ms-workflow-run-tracking-id': 'c0889f0a-8ef9-5555-x111-77ldkfw98r34c54',
                'x-ms-workflow-subscription-id': 'hfhfh-d6s6d-d7d9s-7ASassASasas4',
                'x-ms-workflow-system-id': '/locations/cxxxxx/scaleunits/xxxxx/workflows/6c14xxxxxxxxxxxxxxxb900c',
                'x-ms-workflow-version': '08xxxxxxxxxxxxxxxx150',
                'x-ms-zone-redundancy': 'optional'},
'WebhookName': 'webhook'}

CodePudding user response:

This is going to be annoying and brittle and unreliable, because some of the values also have : (or potentially some of the other special characters) as part of the value; in the example, the URL https://... has a colon but it shouldn't be taken as a JSON colon...

This will be even worse with fields like emailSubject which can probably contain absolutely anything.

The example you gave can work with something like this:

with_quotes = re.sub(r'[{}:,] ', r'"\g<0>"', IhaveThisString)

# fix up URLs broken by the quoting
with_quotes = re.sub('(https?)":"//', r'\g<1>://', with_quotes)

assert with_quotes[0] == with_quotes[-1] == '"'

print(with_quotes)

json_object = json.loads(with_quotes[1:-1])

pprint.pprint(json_object)

However, it's not a great solution, because punctuation in the emailSubject field will throw it off.

Ideally, try fixing the system sending you this data.

Otherwise, you may end up having to parse this by hand, without relying on json.loads, picking out the information you need from the data you have.

  • Related