Home > Net >  Parsing based on pattern not at the beginning
Parsing based on pattern not at the beginning

Time:06-25

I want to extract the number before "2022" in a set of strings possibly. I current do

a= mystring.strip().split("2022")[0]

and, for instance, when mystring=' 1020220519AX', this gives a = '10'. However,

mystring.strip().split("2022")[0]

fails when mystring=' 20220220519AX' to return a='202'. Therefore, I want the code to split the string on "2022" that is not at the beginning non-whitespace characters in the string. Can you please guide with this?

CodePudding user response:

Use a regular expression rather than split().

import re

mystring = '   20220220519AX'
match = re.search(r'^\s*(\d ?)2022', mystring)
if match:
    print(match.group(1))

^\s* skips over the whitespace at the beginning, then (\d ?) captures the following digits up to the first 2022.

CodePudding user response:

You can tell a regex engine that you want all the digits before 2022:

r'\d (?=2022)'

Like .split(), a regex engine is 'greedy' by default - 'greedy' here means that as soon as it can take something that it is instructed to take, it will take that and it won't try another option, unless the rest of the expression cannot be made to work.

So, in your case, mystring.strip().split("2022") splits on the first 2020 it can find and since there's nothing stopping it, that is the result you have to work with.

Using regex, you can even tell it you're not interested in the 2022, but in the numbers before it: the \d will match as long a string of digits it can find (greedy), but the (?=2022) part says it must be followed by a literal 2022 to be a match (and that won't be part of the match, a 'positive lookahead').

Using something like:

import re

mystring = '   20220220519AX'
print(re.findall(r'\d (?=2022)', mystring))

Will show you all consecutive matches.

Note that for a string like ' 920220220519AX 12022', it will find ['9202', '1'] and only that - it won't find all possible combinations of matches. The first, greedy pass through the string that succeeds is the answer you get.

CodePudding user response:

You could split() asserting not the start of the string to the left after using strip(), or you can get the first occurrence of 1 or more digits from the start of the string, in case there are more occurrences of 2022

import re

strings = [
    '   1020220519AX',
    '   20220220519AX'
]

for s in strings:
    parts = re.split(r"(?<!^)2022", s.strip())
    if parts:
        print(parts[0])

for s in strings:
    m = re.match(r"\s*(\d ?)2022", s)
    if m:
        print(m.group(1))

Both will output

10
202

Note that the split variant does not guarantee that the first part consists of digits, it is only splitted.

If the string consists of only word characters, splitting on \B2022 where \B means non a word boundary, will also prevent splitting at the start of the example string.

  • Related