Most efficient way of checking if a string matches a pattern in python?-CodePudding

I have a string format that can be changed by someone else (just say)

sample = f"This is a {pet} it has {number} legs"

And I have currently two string

a = "This is a dog it has 4 legs"
b = "This was a dog"

How to check which string satisfies this sample format? I can use python's string replace() on sample and create regex of it and check using re.match. But the catch is sample can be changed, so statically using replace won't always work, as sample may get more place holders.

CodePudding user response：

A simple little way to extract objects out will be

import re

patt = re.compile(r'This is a (. ) it has (\d ) legs',)

a = "This is a dog it has 4 legs"
b = "This was a dog"
match = patt.search(a)
print(match.group(1), match.group(2))

CodePudding user response：

Try this.

sample = "This is a {pet} it has {number} legs"

def check(string):
    patt = sample.split(' ')
    index = [i for i,v in enumerate(patt) if '{' in v and '}' in v]
    if all(True if v==patt[i] or i in index else False for i,v in enumerate(string.split(' '))):
        print(f'string matches the pattern')
    else:
        print(f"string does not match the pattern")

a = "This is a dog it has 4 legs"
b = "This was a dog"
check(a) # string matches the pattern

CodePudding user response：

First if you want to match a template string don't use the f'' string prefix or else it will be immediately evaluated. Instead just write the format string like:

sample = 'This is a {pet} it has {number} legs'

Here's a function I wrote for one project for parsing a format string and converting it to a regular expression:

import re
import string


def format_to_re(format_str, **kwargs):
    r"""
    Convert a format string to a regular expression, such that any format
    fields may replaced with regular expression syntax, and any literals are
    properly escaped.

    As a special case, if a 2-tuple is given for the value of a field, the
    first time the field appears in the format string the first element of the
    tuple is used as the replacement, and the second element is used for all
    subsequence replacements.

    Examples
    --------

    This example uses a backslash just to add a little Windows flavor:

    >>> filename_format = \
    ...     r'scenario_{scenario}\{name}_{scenario}_{replicate}.npz'
    >>> filename_re = format_to_re(filename_format,
    ...     scenario=(r'(?P<scenario>0*\d )', r'0*\d '),
    ...     replicate=r'0*\d ', name=r'\w ')
    >>> filename_re
    'scenario_(?P<scenario>0*\\d )\\\\\\w _0*\\d _0*\\d \\.npz'
    >>> import re
    >>> filename_re = re.compile(filename_re)
    >>> filename_re
    re.compile(...)

    This regular expression can be used to match arbitrary filenames to
    determine whether or not they are in the format specified by the original
    ``filename_format`` template, as well as to extract the values of fields by
    using groups:

    >>> match = filename_re.match(r'scenario_000\my_model_000_000.npz')
    >>> match is not None
    True
    >>> match.group('scenario')
    '000'
    >>> filename_re.match(r'scenario_000\my_model_garbage.npz') is None
    True
    """

    formatter = string.Formatter()
    new_format = []
    seen_fields = set()

    for item in formatter.parse(format_str):
        literal, field_name, spec, converter = item
        new_format.append(re.escape(literal))

        if field_name is None:
            continue

        replacement = kwargs[field_name]

        if isinstance(replacement, tuple) and len(replacement) == 2:
            if field_name in seen_fields:
                replacement = replacement[1]
            else:
                replacement = replacement[0]

        new_format.append(replacement)
        seen_fields.add(field_name)

    return ''.join(new_format)

You can use this on your example like:

>>> sample_re = format_to_re(sample, pet=r'(?P<pet>. )', number=r'(?P<number>\d )')
>>> sample_re = re.compile(sample_re)
>>> sample_re
re.compile('This\\ is\\ a\\ (?P<pet>. )\\ it\\ has\\ (?P<number>\\d )\\ legs')
>>> m = sample_re.match('This is a dog it has 4 legs')
>>> m.groupdict()
{'pet': 'dog', 'number': '4'}

Depending on your use case you may be able to simplify it a bit. The original version was to handle some application-specific cases.

Another possible enhancement is, given an arbitrary format string, provide default regexps for each field found in it, possibly determined by any format specifiers in the field.

CodePudding user response：

When you run:

sample = f"This is a {pet} it has {number} legs"

sample does not have any placeholders

Sample is the string "This is a xxx it has yyy legs" where xxx and yyy are already replaced. So, unless you know which are the parameters, there is little you can do.

If you want to have placeholders do not use a f-string:

sample = "This is a {pet} it has {number} legs"
formatted_string = sample.format(**{'pet': 'dog', 'number': '4'})
# "This is a dog it has 4 legs"

You can then run something like:

import string
from operator import itemgetter

sample = "This is a {pet} it has {number} legs"

keys = {k: r'\w ' for k in filter(None,
        map(itemgetter(1), string.Formatter().parse(sample)))}
# {'pet': '\\w ', 'number': '\\w '}

regex = re.compile(sample.format(**keys))


a = "This is a dog it has 4 legs"
b = "This was a dog"
regex.match(a)
# <re.Match object; span=(0, 27), match='This is a dog it has 4 legs'>

regex.match(b)
# None

CodePudding user response：

I liked the approaches but I found a two liner solution: (I don't know the performance aspect of this, but it works!)


def pattern_match(input, pattern):
    regex = re.sub(r'{[^{]*}','(.*)', "^"   pattern   "$")
    if re.match(regex, input):
        print(f"'{input}' matches the pattern '{pattern}'")

pattern_match(a, sample)
pattern_match(b, sample)

Output

'This is a dog it has 4 legs' matches the pattern 'This is a {pet} it has {number} legs'