Home > Mobile >  Regex for splitting sentences isn't working properly C#
Regex for splitting sentences isn't working properly C#

Time:11-10

I am trying to write a regex which splits sentences in C#.

My regex isn't working properly, it splits them good, but the last character of the string is always removed. Any tips?

For example if I want to split text to sentences:

Lorem ipsum dolor sit amet. Nam autem doloribus ut perspiciatis omnis est ratione quidem!

My regex splits them into:

Lorem ipsum dolor sit ame

Nam autem doloribus ut perspiciatis omnis est ratione quide

It should be:

Lorem ipsum dolor sit amet

Nam autem doloribus ut perspiciatis omnis est ratione quidem

Sample code

My regex is string variable: patern

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
using System.Threading.Tasks;
using static System.Net.Mime.MediaTypeNames;

namespace L4_17
{
    internal class Program
    {
        static void Main(string[] args)
        {
            const string firstBookData = "first.txt";
            string firstFileData = File.ReadAllText(firstBookData);
            string pattern = "[^\\.\\!\\?] *[\\.\\!\\?]";
            List<string> allSentencesInFirstDataFile = Regex.Split(firstFileData, pattern).ToList();
            foreach(string sentence in allSentencesInFirstDataFile)
            {
             Console.WriteLine(sentence);
            }
            
        }
    }
}

CodePudding user response:

I suggest using different pattern:

[.!?] \s*(?=\p{Lu}|$)

Explanation:

[.!?]        - at least one symbol of ., !, ? (let's support ??, ..., ?! etc.)
\s*          - zero or more white spaces
(?=\p{Lu}|$) - either end of the string or Capital letter of the next sentence 

Code:

var text = "Lorem ipsum dolor sit amet. Nam etc. autem??? Doloribus ut perspiciatis?! Omnis est ratione quidem!";

var lines = Regex.Split(text, @"[.!?] \s*(?=\p{Lu}|$)");

Console.WriteLine(string.Join(Environment.NewLine, lines));

Output:

Lorem ipsum dolor sit amet
Nam etc. autem               # <- note etc. is not the end of the sentence
Doloribus ut perspiciatis
Omnis est ratione quidem
  • Related