I managed to break up a sentence on the basis of the present punctuation. For instance:
import re
sentence = 'i was born in germany (near Frankfurt, in the center of the country) but i live in france. what about you? i know you have a similar story.'
print(list(filter(None, re.split('[!(),.:;?] ', sentence))))
which returns
['i was born in germany ', 'near Frankfurt', ' in the center of the country', ' but i live in france', ' what about you', ' i know you have a similar story']
Now I don't know how to handle some special cases of punctuation for example:
sentence_1 = 'abc.io is a company that employs 10,000 people, half of them in greece.'
with my method I get:
['abc', 'io is a company that employs 10', '000 people', ' half of them in greece']
but I would like to obtain:
['abc.io is a company that employs 10,000 people', ' half of them in greece']
how can I handle this situation (and similar situations too)?
CodePudding user response:
We can try splitting on [!(),.:;?] (?!\S)
:
sentence_1 = 'abc.io is a company that employs 10,000 people, half of them in greece.'
matches = re.split(r'[!(),.:;?] (?!\S)', sentence_1)
matches = [x for x in matches if x != '']
print(matches)
# ['abc.io is a company that employs 10,000 people', ' half of them in greece']
This answer assumes that a punctuation split should only occur when punctuation is followed by whitespace or the end of the input. We filter off empty string matches which might arise.
CodePudding user response:
You can use
re.split(r'(?:,(?!(?<=\d.)\d)|(?!\b\.\b)\.|[!():;?]) ', text)
See the regex demo. It matches
(?:
- start of a non-capturing group,(?!(?<=\d.)\d)
- a comma not between digits|
- or(?!\b\.\b)\.
- a dot that is not enclosed with word chars|
- or[!():;?]
- a char from the set
)
- end of the group, one or more times
CodePudding user response:
You could demand a space after a special character:
print(list(filter(None, re.split('[!(),.:;?]\s ', sentence_1))))
which gives the desired result
CodePudding user response:
You could split matching on the punctuation chars except for the dot and comma, or match the dot and comma followed by a whitespace boundary.
[!():;?]|[.,](?!\S)
Then you can filter the result for empty entries.
import re
strings = [
"i was born in germany (near Frankfurt, in the center of the country) but i live in france. what about you? i know you have a similar story.",
"abc.io is a company that employs 10,000 people, half of them in greece."
]
pattern = "[!():;?]|[.,](?!\S)"
for s in strings:
print([res for res in re.split(pattern, s) if res])
Output
['i was born in germany ', 'near Frankfurt', ' in the center of the country', ' but i live in france', ' what about you', ' i know you have a similar story']
['abc.io is a company that employs 10,000 people', ' half of them in greece']