I am trying to write a regex which splits sentences in C#.
My regex isn't working properly, it splits them good, but the last character of the string is always removed. Any tips?
For example if I want to split text to sentences:
Lorem ipsum dolor sit amet. Nam autem doloribus ut perspiciatis omnis est ratione quidem!
My regex splits them into:
Lorem ipsum dolor sit ame
Nam autem doloribus ut perspiciatis omnis est ratione quide
It should be:
Lorem ipsum dolor sit amet
Nam autem doloribus ut perspiciatis omnis est ratione quidem
Sample code
My regex is string variable: patern
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
using System.Threading.Tasks;
using static System.Net.Mime.MediaTypeNames;
namespace L4_17
{
internal class Program
{
static void Main(string[] args)
{
const string firstBookData = "first.txt";
string firstFileData = File.ReadAllText(firstBookData);
string pattern = "[^\\.\\!\\?] *[\\.\\!\\?]";
List<string> allSentencesInFirstDataFile = Regex.Split(firstFileData, pattern).ToList();
foreach(string sentence in allSentencesInFirstDataFile)
{
Console.WriteLine(sentence);
}
}
}
}
CodePudding user response:
I suggest using different pattern:
[.!?] \s*(?=\p{Lu}|$)
Explanation:
[.!?] - at least one symbol of ., !, ? (let's support ??, ..., ?! etc.)
\s* - zero or more white spaces
(?=\p{Lu}|$) - either end of the string or Capital letter of the next sentence
Code:
var text = "Lorem ipsum dolor sit amet. Nam etc. autem??? Doloribus ut perspiciatis?! Omnis est ratione quidem!";
var lines = Regex.Split(text, @"[.!?] \s*(?=\p{Lu}|$)");
Console.WriteLine(string.Join(Environment.NewLine, lines));
Output:
Lorem ipsum dolor sit amet
Nam etc. autem # <- note etc. is not the end of the sentence
Doloribus ut perspiciatis
Omnis est ratione quidem