force re.search to include # and $-CodePudding

I am trying to get a substring between two markers using re in Python, for example:

import re
test_str = "#$ -N model_simulation 2022"

# these two lines work
# the output is: model_simulation
print(re.search("-N(.*)2022",test_str).group(1))
print(re.search(" -N(.*)2022",test_str).group(1))

# these two lines give the error: 'NoneType' object has no attribute 'group'
print(re.search("$ -N(.*)2022",test_str).group(1))
print(re.search("#$ -N(.*)2022",test_str).group(1))

I read the documentation of re here. It says that "#" is intentionally ignored so that the outputs look neater.

But in my case, I do need to include "#" and "$". I need them to identify the part of the string that I want, because the "-N" is not unique in my entire text string for real work.

Is there a way to force re to include those? Or is there a different way without using re?

Thanks.

CodePudding user response：

You can escape both with \, for example,

print(re.search("\#\$ -N(.*)2022",test_str).group(1))
# output  model_simulation

CodePudding user response：

You can get rid of the special meaning by using the backslash prefix: $. This way, you can match the dollar symbol in a given string

# add backslash before # and $ 
# the output is: model_simulation
print(re.search("\$ -N(.*)2022",test_str).group(1))
print(re.search("\#\$ -N(.*)2022",test_str).group(1))

CodePudding user response：

In regular expressions, $ signals the end of the string. So 'foo' would match foo anywhere in the string, but 'foo$' only matches foo if it appears at the end. To solve this, you need to escape it by prefixing it with a backslash. That way it will match a literal $ character

# is only the start of a comment in verbose mode using re.VERBOSE (which also ignores spaces), otherwise it just matches a literal #.

In general, it is also good practice to use raw string literals for regular expressions (r'foo'), which means Python will let backslashes alone so it doesn't conflict with regular expressions (that way you don't have to type \\\\ to match a single backslash \).

Instead of re.search, it looks like you actually want re.fullmatch, which matches only if the whole string matches.

So I would write your code like this:

print(re.search(r"\$ -N(.*)2022", test_str).group(1)) # This one would not work with fullmatch, because it doesn't match at the start
print(re.fullmatch(r"#\$ -N(.*)2022", test_str).group(1))

In a comment you mentioned that the string you need to match changes all the time. In that case, re.escape may prove useful.

Example:

prefix = '#$ - N'
postfix = '2022'
print(re.fullmatch(re.escape(prefix)   '(.*)'   re.escape(postfix), tst_str).group(1))