Remove unfinished sentence at the end of a block of text-CodePudding

Here is a given text:

The universe is so big that it’s impossible to comprehend. Our solar system is just a tiny speck in the vastness of space. And our galaxy, the Milky Way, is just one of billions of galaxies in the universe.

The universe is thought to be around 14 billion years old. That’s a lot of time for stars and galaxies to form and change. In fact, the universe is still expanding and will continue to do so forever.

There are billions of galaxies out there, each with billions of stars. And our own Milky Way galaxy is just one of them. In fact, there are so many galaxies that we can't even count them all. And each one of those galaxies is huge. Just trying to wrap your head around the size of one galaxy can be difficult, let alone the size of the entire universe.

But even though the universe is incredibly large, it's also expanding. So it's actually getting bigger every day. Scientists believe that the universe started with a big bang 13.8 billion years ago and has been expanding ever since. That means that the universe has been getting bigger for a very long time and

The problem:

From time to time, the text finishes (always the last sentence) with broken sentences "That means that the universe has been getting bigger for a very long time and....". I need to be able to detect when my block of text has been truncated like this and remove the truncated sentence fully before outputting the text. In short, any endings of the text that are not well-formatted with the below options should be removed.

A sentence may finish by

-A dot(.)

-Exclamation mark(!)

-Question mark (?)

-double dots (:)

Warning, not sure if this may cause issues but there are blank spaces between the paragraphs.

I am wondering if using this could be a good start:

<?php
if (!str_ends_with($text, ['.','!','?',':'])) {
    //If the string does not ends with the above options
    //This is where I struggle....what to do? How to count all the 
    //characters from the last ['.','!','?',':']
}
?>

Any idea how to do this please?

CodePudding user response：

You can never do it with 100% certainty because sentences can contain d.o.t.s in the middle. But usually there's a space after the stopper, so this might give us a good enough chance in parsing most of the sentences.

$text = "But even though the universe is incredibly large, it's also expanding.
So it's actually getting bigger every day. Scientists believe that the universe started with a
big bang 13.8 billion years ago and has been expanding ever since. That means that the universe has 
been getting n.y.p.d bigger for an ip address of 12.32.43.21 very long time and";

$text = trim($text) . " ";
$arr = preg_split('/[\.!\?:] /', $text);
if ($arr[count($arr) - 1] !== '') {
    $search = "!.?:";
    $max = 0;
    for ($i=0; $i<strlen($search); $i  ) {
        $sign = $search[$i] . " ";
        $pos = strrpos($text, $sign);
        if ($pos>$max) {
            $max = $pos;
        }
    }
    $text = substr($text, 0, $max   1);
    echo ($text);
}

Output:

But even though the universe is incredibly large, it's also expanding. So it's actually getting bigger every day. Scientists believe that the universe started with a big bang 13.8 billion years ago and has been expanding ever since.

CodePudding user response：

I would not want this task to hit my desk at work. This will be doomed to fail unexpectedly over the long-term. The English language is too complex to parse with mere regex. For now, just match everything up to the last occurring whitelisted character, then release the matched characters with \K. Then replace everything after that to the end of the string.

When the last sentence finishes with a qualifying punctuation mark, then no replacement action is done.

Code: (Demo)

echo preg_replace(
         '/.*[.!?:]\K[^.!?:] $/s',
         '',
         $text
     );

To clarify a fringe case, this pattern will not reduce the text to an empty string when there are no punctuation marks in the string. Instead, this pattern requires at least one punctuation mark and only removes an unfinished sentence AFTER a finished sentence. I think you unlikely to encounter this case in your actual project, but it is probably something that I should explicitly mention. (Demo)

If you wanted to remove substrings containing no punctuation marks to the end of the string, the pattern is /[^.!?:] $/. This will destroy whole un-punctuated texts. (Demo) And it will still work as desired on your original sample string. (Demo)