Handling of special cases of punctuation when creating tokens with regex-CodePudding

I managed to break up a sentence on the basis of the present punctuation. For instance:

import re
sentence = 'i was born in germany (near Frankfurt, in the center of the country) but i live in france. what about you? i know you have a similar story.'
print(list(filter(None, re.split('[!(),.:;?] ', sentence))))

which returns

['i was born in germany ', 'near Frankfurt', ' in the center of the country', ' but i live in france', ' what about you', ' i know you have a similar story']

Now I don't know how to handle some special cases of punctuation for example:

sentence_1 = 'abc.io is a company that employs 10,000 people, half of them in greece.'

with my method I get:

['abc', 'io is a company that employs 10', '000 people', ' half of them in greece']

but I would like to obtain:

['abc.io is a company that employs 10,000 people', ' half of them in greece']

how can I handle this situation (and similar situations too)?

CodePudding user response：

We can try splitting on [!(),.:;?] (?!\S):

sentence_1 = 'abc.io is a company that employs 10,000 people, half of them in greece.'
matches = re.split(r'[!(),.:;?] (?!\S)', sentence_1)
matches = [x for x in matches if x != '']
print(matches)

# ['abc.io is a company that employs 10,000 people', ' half of them in greece']

This answer assumes that a punctuation split should only occur when punctuation is followed by whitespace or the end of the input. We filter off empty string matches which might arise.

CodePudding user response：

You can use

re.split(r'(?:,(?!(?<=\d.)\d)|(?!\b\.\b)\.|[!():;?]) ', text)

See the regex demo. It matches

(?: - start of a non-capturing group
- ,(?!(?<=\d.)\d) - a comma not between digits
- | - or
- (?!\b\.\b)\. - a dot that is not enclosed with word chars
- | - or
- [!():;?] - a char from the set
) - end of the group, one or more times

CodePudding user response：

You could demand a space after a special character:

print(list(filter(None, re.split('[!(),.:;?]\s ', sentence_1))))

which gives the desired result

CodePudding user response：

You could split matching on the punctuation chars except for the dot and comma, or match the dot and comma followed by a whitespace boundary.

[!():;?]|[.,](?!\S)

Regex demo

Then you can filter the result for empty entries.

import re

strings = [
    "i was born in germany (near Frankfurt, in the center of the country) but i live in france. what about you? i know you have a similar story.",
    "abc.io is a company that employs 10,000 people, half of them in greece."
]

pattern = "[!():;?]|[.,](?!\S)"

for s in strings:
    print([res for res in re.split(pattern, s) if res])

Output

['i was born in germany ', 'near Frankfurt', ' in the center of the country', ' but i live in france', ' what about you', ' i know you have a similar story']
['abc.io is a company that employs 10,000 people', ' half of them in greece']