I want to replace all dot except the ones between digit or followed by specific text with \n using Python.
Input: I have a meeting at 8.30. I will be at meet.com. Bye.
Output: I have a meeting at 8.30 \n I will be at meet.com \n Bye \n
Here is my trying code:
def replace_dot_for_original_sentence(text):
dot = "."
for char in text:
if char in dot:
if not re.match(r'(?<=\d)\.(?=\d)', text) or not re.match(r'\.(com|org|net|co|id)', text):
text = text.replace(char, "\n")
return text
What it does is replace all the dots to \n. I have tried also using re.search, I think there's problem in the if conditions? Any ideas?
CodePudding user response:
We can try using re.sub
here:
inp = "I have a meeting at 8.30. I will be at meet.com. Bye."
output = re.sub(r'\.(?:\s |$)', ' \n ', inp)
print(output) # I have a meeting at 8.30 \n I will be at meet.com \n Bye \n
CodePudding user response:
Try this pattern:
import re
def replace_dot_for_original_sentence(text):
text = re.sub(r'\.\s |\.$', '\n', text)
return text
print(replace_dot_for_original_sentence('I have a meeting at 8.30. I will be at meet.com. Bye.'))
Output
I have a meeting at 8.30
I will be at meet.com
Bye
CodePudding user response:
re.match(r'(?<=\d)\.(?=\d)', text)
is not None
if text
in its entirety is a match for a period with digits before or after it (and nothing else). That's never the case, so it's always None
and not re.match(r'(?<=\d)\.(?=\d)', text)
is always True
.
Similar, re.match(r'\.(com|org|net|co|id)', text)
is always True
, unless the text
is just something like .com
.
You then proceed to text = text.replace(char, "\n")
, across the entire text - so even if your condition worked, this would still just replace the lot of them if the condition correctly decided that something needed replacing.
If you want every period that's not followed by any of com|org|net|co|id
and also not followed by \d
(because you do want to replace the period after 8.30, a probably also don't want to replace the .
in something like '$.30'
for example), this works:
def replace_dot_for_original_sentence(text):
return re.sub(r"(?s)\.(?!\d)(?!com|org|net|co|id)", "\n", text)
The whole for loop thing doesn't do anything, it just ensures your code only runs if there's a period in the string in a very roundabout way.
Note that the expression still needs some work. For example, because you have co
in there, .coupons
, .cooking
and .courses
(to name but a few) are now matched and skipped as well. While stuff like .co.uk
still gets cut off in the middle.
If it works for your data set, great. But don't view it as even a halfway decent way to detect the end of URLs.
CodePudding user response:
You can put the two conditions inside a single negative lookahead like that:
\.(?!(?:com?|org|net|id)\b|(?<=\d\.)\d)
and nothing forbids to put a lookbehind inside it to check the condition with digits.
def replace_dot_for_original_sentence(text):
return re.sub(r'\.(?!(?:com?|org|net|id)\b|(?<=\d\.)\d)', "\n", text)