Home > Software design >  Use of \r (carriage return) in python regex
Use of \r (carriage return) in python regex

Time:11-03

I'm trying to use regex to match every character between a string and a \r character :

text = 'Some text\rText to find !\r other text\r'

I want to match 'Text to find !'. I already tried :

re.search(r'Some text\r(.*)\r', text).group(1)

But it gives me : 'Text to find !\r other text'

It's surprising because it works perfectly when replacing \r by \n :

re.search(r'Some text\n(.*)\n', 'Some text\nText to find !\n other text\n').group(1)

returns Text to find !

Do you know why it behaves differently when we use \r and \n ?

CodePudding user response:

That is correct and expected behavior since . by default in Python re does not match LF chars only, it matches CR (carriage return) chars.

See the re documentation:

.
           (Dot.) In the default mode, this matches any character except a newline. If the DOTALL flag has been specified, this matches any character including a newline.

You can easily check that with the following code:

import re
unicode_lbr = '\n\v\f\r\u0085\u2028\u2029'
print( re.findall(r'. ', f'abc{unicode_lbr}def') )
# => ['abc', '\x0b\x0c\r\x85\u2028\u2029def']

To match between two carriage return chars you need to use the negated character class:

r'Some text\r([^\r]*)\r'
r'Some text\r([^\r]*)'   # if the trailing CR char does not have to exist

In case you want to match between the leftmost and rightmost occurrences of \r chars (the outer CR chars) including any chars in between you can use a mere .* with re.DOTALL:

re.search(r'(?s)Some text\r(.*)\r', text)
re.search(r'Some text\r(.*)\r', text, re.DOTALL)

where (?s) is an inline modifier equal to re.DOTALL / re.S.

CodePudding user response:

.* is greedy in nature so it is matching longest match available in:

r'Some text\r(.*)\r

Hence giving you:

re.findall(r'Some text\r(.*)\r', 'Some text\rText to find !\r other text\r')
['Text to find !\r other text']

However if you change to non-greedy then it gives expected result as in:

re.findall(r'Some text\r(.*?)\r', 'Some text\rText to find !\r other text\r')
['Text to find !']

Reason why re.findall(r'Some text\n(.*)\n', 'Some text\nText to find !\n other text\n') gives just ['Text to find !'] is that DOT matches any character except line break and \n is a line break. If you enable DOTALL then again it will match longest match in:

>>> re.findall(r'Some text\n([\s\S]*)\n', 'Some text\nText to find !\n other text\n')
['Text to find !\n other text']

>>> re.findall(r'(?s)Some text\n(.*)\n', 'Some text\nText to find !\n other text\n')
['Text to find !\n other text']

Which again changes behavior when you use non-greedy quantifier:

re.findall(r'(?s)Some text\n(.*?)\n', 'Some text\nText to find !\n other text\n')
['Text to find !']
  • Related