Home > Mobile >  Select substring based on first match, requiring an earlier 2nd pattern match too
Select substring based on first match, requiring an earlier 2nd pattern match too

Time:11-17

I'm trying to pull a specific Session ID from some very messy log files and even looking at other examples am having a hard time implementing any regular expressions that appear to work. The logs come in reading like this, and I am trying to collect the specific SessionId that appears after a given an earlier string is found for the first time "TestAPIurl"

b'{\n  "log": {\n    "version": "1.2",\n    "creator": {\n      "name": "mitmproxy har_dump",
\n      "version": "0.1",\n      "GET": "TestAPIurl"\n    },
\n    "entries": [\n      {\n        "startedDateTime": "2021-10-11T22:39:30.916773 00:00",
\n        "sessionId": "1521-35236erg-fggr4",\n        "request": {\n          "method": "POST",\n         

I'm struggling with the regex required to ensure that I only get the sessionId that comes after the proper TestAPIurl. I have managed to be able to capture the sessionID value but there are multiple sessionIds and I am having difficulty making sure I get the first one after the pattern "TestAPIurl" as the space and values between "TestAPIurl" and the next "sessionId" can vary a lot.

m = re.search('(?<=sessionId":\s.).*(?=")', file)
m.group()

CodePudding user response:

First off, that looks a lot like JSON. Is processing it as JSON an option? If so, I'd definitely try to make that work as I'd have complete control over knowing which session belongs to which request.

But if it's not...

I stripped your sample data down, then doubled it up, making TestAPIurlFoo and TestAPIurlBar:

import re

input = '''
{\n  "log": {\n
\n     "version": "0.1",\n  "GET": "TestAPIurlFoo" },
\n   "entries": [\n  {\n
\n     "sessionId": "sessionIdFoo",\n  "request": {\n "method": "POST",\n ...
{\n  "log": {\n
\n      "version": "0.1",\n  "GET": "TestAPIurlBar" },
\n    "entries": [\n  {\n
\n      "sessionId": "sessionIdBar",\n  "request": {\n "method": "POST",\n ...
'''

api_url_names = ['TestAPIurlFoo','TestAPIurlBar', 'TestAPIurlBaz']

for api_url_name in api_url_names:
    pattern = f'"{api_url_name}". ?"sessionId": "(. ?)"'  # build pattern for either Foo or Bar

    match = re.search(pattern, input, re.DOTALL)  # DOTALL so that `.` matches newlines
    if match:
        print(match.group(1))

and I get the following:

sessionIdFoo
sessionIdBar

The search for TestAPIurlBaz returns nothing, which I think is what you want. re.DOTALL allows search to match text across line breaks with . (dot) character class^1.

I think of your requirement, in regex terms, like:

the pattern I want starts with this API-url and ends with the first session id

Here's the pattern for Foo:

"TestAPIurlFoo". ?"sessionId": "(. ?)"

. ? means, the least amount of characters between the literals "TestAPIurlFoo" and "sessionId".

CodePudding user response:

Here is a regular expression that will return the value of the sessionId key.

>>> re.search('"sessionId"\s*:\s*"(. ?)"', file).group(1)
1521-35236erg-fggr4

Note that the string you provided was prefixed with b, so it looks like a bytes rather than a str - you may have to prefix the regular expression pattern with b as well, or pass something like file.decode("utf8") instead of just file to re.search().

  • Related