Home > Enterprise >  Handling of special cases of punctuation when creating tokens with regex
Handling of special cases of punctuation when creating tokens with regex

Time:10-14

I managed to break up a sentence on the basis of the present punctuation. For instance:

import re
sentence = 'i was born in germany (near Frankfurt, in the center of the country) but i live in france. what about you? i know you have a similar story.'
print(list(filter(None, re.split('[!(),.:;?] ', sentence))))

which returns

['i was born in germany ', 'near Frankfurt', ' in the center of the country', ' but i live in france', ' what about you', ' i know you have a similar story']

Now I don't know how to handle some special cases of punctuation for example:

sentence_1 = 'abc.io is a company that employs 10,000 people, half of them in greece.'

with my method I get:

['abc', 'io is a company that employs 10', '000 people', ' half of them in greece']

but I would like to obtain:

['abc.io is a company that employs 10,000 people', ' half of them in greece']

how can I handle this situation (and similar situations too)?

CodePudding user response:

We can try splitting on [!(),.:;?] (?!\S):

sentence_1 = 'abc.io is a company that employs 10,000 people, half of them in greece.'
matches = re.split(r'[!(),.:;?] (?!\S)', sentence_1)
matches = [x for x in matches if x != '']
print(matches)

# ['abc.io is a company that employs 10,000 people', ' half of them in greece']

This answer assumes that a punctuation split should only occur when punctuation is followed by whitespace or the end of the input. We filter off empty string matches which might arise.

CodePudding user response:

You can use

re.split(r'(?:,(?!(?<=\d.)\d)|(?!\b\.\b)\.|[!():;?]) ', text)

See the regex demo. It matches

  • (?: - start of a non-capturing group
    • ,(?!(?<=\d.)\d) - a comma not between digits
    • | - or
    • (?!\b\.\b)\. - a dot that is not enclosed with word chars
    • | - or
    • [!():;?] - a char from the set
  • ) - end of the group, one or more times

CodePudding user response:

You could demand a space after a special character:

print(list(filter(None, re.split('[!(),.:;?]\s ', sentence_1))))

which gives the desired result

CodePudding user response:

You could split matching on the punctuation chars except for the dot and comma, or match the dot and comma followed by a whitespace boundary.

[!():;?]|[.,](?!\S)

Regex demo

Then you can filter the result for empty entries.

import re

strings = [
    "i was born in germany (near Frankfurt, in the center of the country) but i live in france. what about you? i know you have a similar story.",
    "abc.io is a company that employs 10,000 people, half of them in greece."
]

pattern = "[!():;?]|[.,](?!\S)"

for s in strings:
    print([res for res in re.split(pattern, s) if res])

Output

['i was born in germany ', 'near Frankfurt', ' in the center of the country', ' but i live in france', ' what about you', ' i know you have a similar story']
['abc.io is a company that employs 10,000 people', ' half of them in greece']
  • Related