Home > other >  python regex: match the dot only, not the letter before it
python regex: match the dot only, not the letter before it

Time:12-08

I have a regex pattern as follows:

r'(?:(?<!\.|\s)[a-z]\.|(?<!\.|\s)[A-Z]\.) '

and I am trying to modify that so it only matches the dot at the end of the sentences and not the letter before them. here is my string:

sent = 'This is the U.A. we have r.a.d. golden 13.56 date. a better date 34. was there.'

and here is what i have done:

import re
re.split(r'(?:(?<!\.|\s)[a-z]\.|(?<!\.|\s)[A-Z]\.) ', sent)

however what happens is that it removes the last letter of the words:

current output:
['This is the U.A. we have r.a.d. golden 13.56 dat',' a better date 34. was ther',
 '']

my desired output is:

['This is the U.A. we have r.a.d. golden 13.56 date',' a better date 34. was there',
 '']

i do not know how I can modify the pattern to keep the last letter of the words 'date', and 'there'

CodePudding user response:

Your pattern can be reduced to and fixed as

(?<=(?<![.\s])[a-zA-Z])\.

See the regex demo.

If you need to also match multiple dots, put back after the \..

Details:

  • (?<=(?<![.\s])[a-zA-Z]) - a positive lookbehind that matches a location that is immediately preceded with
    • (?<![.\s]) - a negative lookbehind that fails the match if there is a . or whitespace immediately to the left of the current location
    • [a-zA-Z] - an ASCII letter
  • \. - a literal dot.

Look, your pattern is basically an alternation of two patterns, (?<!\.|\s)[a-z]\. and (?<!\.|\s)[A-Z]\., the only difference between which is [a-z] and [A-Z]. It is clear the same alternation can be shortened to (?<!\.|\s)[a-zA-Z]\. The [a-zA-Z] must be put into a non-consuming pattern so that the letters could not be eaten up when splitting, so using a positive lookbehind is a natural solution.

  • Related