I have a function:

with open(filename,'r') as text:
   data=text.readlines()
   split=str(data).split('([.|?])')
   for line in split:
      print(line)

This prints the sentences that we have after splitting a text by 2 different marks. I also want to show the split symbol in the output, this is why I use () but the split do not work fine.

It returns:

['Chapter 16. My new goal. \n','Chapter 17. My new goal 2. \n']

As you can see the split haven't splitted by all dots.

CodePudding user response：

Try escaping the marks, as both symbols have functional meanings in RegEx. Also I'm quite not sure if the str.split method takes regex. maybe try it with split from Python's "re" module.

[\.|\?]

CodePudding user response：

Use re.split

import re

data = 'Chapter 16. My new goal? Chapter 17. My new goal 2'
splitted=re.split('[.|?]', data)
print(splitted)

which splits on each . or ? to produce

['Chapter 16', ' My new goal', ' Chapter 17', ' My new goal 2']

CodePudding user response：

There are a few distinct problems, here.

1. read vs readlines

    data = text.readlines()

This produces a list of str, good.

... str(data) ...

If you print this, you will see it contains several characters you likely did not want: [, ', ,, ].

You'd be better off with just data = text.read().

2. split on str vs regex

str(data).split('([.|?])')

We are splitting on a string, ok. Let's consult the fine documents.

Return a list of the words in the string, using sep as the delimiter string.

Notice there's no mention of a regular expression. That argument does not appear as sequence of seven characters in the source string. You were looking for a similar function:

https://docs.python.org/3/library/re.html#re.split

3. char class vs alternation

We can certainly use | vertical bar for alternation, e.g. r"(cat|dog)".

It works for shorter strings, too, such as r"(c|d)". But for single characters, a character class is more convenient: r"[cd]".

It is possible to match three characters, one of them being vertical bar, with r"[c|d]" or equivalently r"[cd|]".

A character class can even have just a single character, so r"[c]" is identical to r"c".

4. escaping

Since r".*" matches whole string, there are certainly cases where escaping dot is important, e.g. r"(cat|dog|\.)".

We can construct a character class with escaping: r"[cd\.]".

Within [ ] square brackets that \ backwhack is optional. Better to simply say r"[cd.]", which means the same thing.

    pattern = re.compile(r"[.?]")

5. findall vs split

The two functions are fairly similar.

But findall() is about retrieving matching elements, which your "preserve the final punctuation" requirement asks for, while split() pretty much assumes that the separator is uninteresting. So findall() seems a better match for your use case.

    pattern = re.compile(r"[^.?] [.?]")

Note that ^ caret usually means "anchor to start of string", but within a character class it is negation. So e.g. r"[^0-9]" means "non-digit".

    data = text.readlines()
    split = str(data).split('([.|?])')

Putting it all together, try this:

    data = text.read()
    pattern = re.compile(r"[^.?] [.?]")
    sentences = pattern.findall(data)

If there's no trailing punctuation in the source string, the final words won't appear in the result. Consider tacking on a "." period in that case.