I have a regex pattern as follows:
r'(?:(?<!\.|\s)[a-z]\.|(?<!\.|\s)[A-Z]\.) '
and I am trying to modify that so it only matches the dot at the end of the sentences and not the letter before them. here is my string:
sent = 'This is the U.A. we have r.a.d. golden 13.56 date. a better date 34. was there.'
and here is what i have done:
import re
re.split(r'(?:(?<!\.|\s)[a-z]\.|(?<!\.|\s)[A-Z]\.) ', sent)
however what happens is that it removes the last letter of the words:
current output:
['This is the U.A. we have r.a.d. golden 13.56 dat',' a better date 34. was ther',
'']
my desired output is:
['This is the U.A. we have r.a.d. golden 13.56 date',' a better date 34. was there',
'']
i do not know how I can modify the pattern to keep the last letter of the words 'date', and 'there'
CodePudding user response:
Your pattern can be reduced to and fixed as
(?<=(?<![.\s])[a-zA-Z])\.
See the regex demo.
If you need to also match multiple dots, put back
after the \.
.
Details:
(?<=(?<![.\s])[a-zA-Z])
- a positive lookbehind that matches a location that is immediately preceded with(?<![.\s])
- a negative lookbehind that fails the match if there is a.
or whitespace immediately to the left of the current location[a-zA-Z]
- an ASCII letter
\.
- a literal dot.
Look, your pattern is basically an alternation of two patterns, (?<!\.|\s)[a-z]\.
and (?<!\.|\s)[A-Z]\.
, the only difference between which is [a-z]
and [A-Z]
. It is clear the same alternation can be shortened to (?<!\.|\s)[a-zA-Z]\.
The [a-zA-Z]
must be put into a non-consuming pattern so that the letters could not be eaten up when splitting, so using a positive lookbehind is a natural solution.