Extract entities from selenium performace log using regex in python-CodePudding

I am using selenium and getting a performance log, the log is in raw text, and I want to extract a few entities and their values from the text.

I am expecting the final output like this:

{"cctv-c1-Id1id" : 1234, "cctv-key1id" : 2345,  "cctv-revise1id" : 5678, "cctv-path-Id":89}

What I have tried:

import re

regex = r"\"cctv-c1-Id1id\":\". ?\"|\"cctv-key1id\":\". ?\"|\"cctv-revise1id\":\". ?\"|\"cctv-path-Id\":\". ?\""

test_str = """{"columnNumber":1275999,"functionName":"","lineNumber":0,"scriptId":"27","url":"https://qsbr.fs.cctvcdn_example.net/-4-ans_frontend-relay-common-27-02324aa836affcweewa40.webpack"}]},"type":"script"},"loaderId":"E2FF94DD7C5E414sfsfggdf6F36DE","rInfo":false,"request":{"hascctvData":true,"headers":{"Content-Type":"application/json","cctv-c1-Id1id":"main-w-chan49-8121-react_skgfksghksgs-4VTx","cctv-Canary-Revision":"false","cctv-key1id":"237459235723jjsefsdhf34","cctv-revise1id":"sgfsgkdhfgkj546456","cctv-path-Id":"react_35346363546546","Referer":"https://www.cctv.com/","User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)"""

matches       = re.finditer(regex, test_str, re.MULTILINE)
matches_group = {}
for matchNum, match in enumerate(matches, start=1):
    data_matched = match.group().split(":")
    matches_group[data_matched[0].strip('""')] = data_matched[1].strip('""')

Is there a better way to do the same task?

CodePudding user response：

I would shorten your regex

\"(cctv-c1-Id1id|cctv-key1id|cctv-revise1id|cctv-path-Id)\":\"(. ?)\"

This will capture only keys: cctv-c1-Id1id, cctv-key1id, cctv-revise1id, cctv-path-Id and their values. With re.findall we will get a list of key-value tuples which easy to convert to a dict.

Code

pattern = r"\"(cctv-c1-Id1id|cctv-key1id|cctv-revise1id|cctv-path-Id)\":\"(. ?)\""
matches_group = dict(re.findall(pattern, test_str))

CodePudding user response：

Let's break up this problem first to just finding a proper pattern: The example is not ideal as it suggests that you expect numbers and not character-strings as the keys but your code-example is all the better for it. So let's go there step by step

we look for certain keys wrapped into double high comas: r'\"cctv\-((c1\-Id1id)|(key1id)|(revise1id)|(path\-Id))\"' we can shorten your pattern by realizing that the keys start always with 'cctv-'. Note that you need to escape the dash - as regex interprets it as a range otherwise (so the range of characters between v and c... which is none for example)
the key is followed by a colon but there my be wide spaces, so we better make them optional: `\s?:\s?'
the value will be arbitrary characters wrapped into double high comas again: \"\w \" ( let regex look for one or more characters). If you would like to look only for numbers, use \d
finally, it ends with a coma , (you may want to make this optional if the very last log entry is not closed by a coma)

import re

test_str = """{"columnNumber":1275999,"functionName":"","lineNumber":0,"scriptId":"27","url":"https://qsbr.fs.cctvcdn_example.net/-4-ans_frontend-relay-common-27-02324aa836affcweewa40.webpack"}]},"type":"script"},"loaderId":"E2FF94DD7C5E414sfsfggdf6F36DE","rInfo":false,"request":{"hascctvData":true,"headers":{"Content-Type":"application/json","cctv-c1-Id1id":"main-w-chan49-8121-react_skgfksghksgs-4VTx","cctv-Canary-Revision":"false","cctv-key1id":"237459235723jjsefsdhf34","cctv-revise1id":"sgfsgkdhfgkj546456","cctv-path-Id":"react_35346363546546","Referer":"https://www.cctv.com/","User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)"""


pat = r'\"cctv\-((c1\-Id1id)|(key1id)|(revise1id)|(path\-Id))\"\s?:\s?\"\w \"'
for match in re.finditer(pat, test_str, re.MULTILINE):
    print(match.group())

"cctv-key1id":"237459235723jjsefsdhf34"
"cctv-revise1id":"sgfsgkdhfgkj546456"
"cctv-path-Id":"react_35346363546546"

You want to build a dictionary from the key-value pairs, so it might be better to strip the patterns into parts. We first look for the whole pattern and than for the parts within the matched string:

# patterns
pat_key = r'cctv\-((c1\-Id1id)|(key1id)|(revise1id)|(path\-Id))'
pat_val = r'\w '
pat = r'\"'   pat_key   '\"\s?:\s?\"'   pat_val   '\"'

matches = {}
for match in re.finditer(pat, test_str, re.MULTILINE):
    # extract key
    key = re.search(pat_key, match.group()).group()
    # extract value
    val = re.search(pat_key, match.group()).group()
    # assign to dictionary
    matches[key] = val

{'cctv-key1id': 'cctv-key1id',
'cctv-revise1id': 'cctv-revise1id',
'cctv-path-Id': 'cctv-path-Id'}