Home > Blockchain >  Match string from ">" up to the last dot
Match string from ">" up to the last dot

Time:11-17

I have to select only the characters contained from > to the last dot (not the first dot).

I tried this pattern

^>[a-zA-Z] $

but something doesn't work. Can I get some help? Thank you.

Lorem ipsum dolor sit amet, consectetuer adipiscing elit,
sed diam nonummy nibh euismod tincidunt ut laoreet dolore
magna aliquam erat volutpat.

>Lorem ipsum dolor sit amet, consectetuer adipiscing elit,
sed diam nonummy nibh euismod tincidunt ut laoreet dolore
magna aliquam erat volutpat.
Lorem ipsum dolor sit amet, consectetuer adipiscing elit,
sed diam nonummy nibh euismod tincidunt ut laoreet dolore
magna aliquam erat volutpat.

Lorem ipsum dolor sit amet, consectetuer adipiscing elit,
sed diam nonummy nibh euismod tincidunt ut laoreet dolore
magna aliquam erat volutpat.

CodePudding user response:

I made an example using javascript to have a working demo but since I had to use a strategy where the commonly used . (dot) should match also line breaks, in this case I was forced to use [\s\S] instead.

the regex ^>[\s\S] \.\n expects to find a > at the beginning of the line followed by any character until the last dot found followed by a new line.

This demo feeds the regex match with the full text and returns just the middle part as you were expecting:

const subject = `
Lorem ipsum dolor sit amet, consectetuer adipiscing elit,
sed diam nonummy nibh euismod tincidunt ut laoreet dolore
magna aliquam erat volutpat.

>Lorem ipsum dolor sit amet, consectetuer adipiscing elit,
sed diam nonummy nibh euismod tincidunt ut laoreet dolore
magna aliquam erat volutpat.
Lorem ipsum dolor sit amet, consectetuer adipiscing elit,
sed diam nonummy nibh euismod tincidunt ut laoreet dolore
magna aliquam erat volutpat.

Lorem ipsum dolor sit amet, consectetuer adipiscing elit,
sed diam nonummy nibh euismod tincidunt ut laoreet dolore
magna aliquam erat volutpat.`;

var re = /^>[\s\S] \.\n/im;
var match = re.exec(subject);
if (match != null) {  
    result = match[0];
} else {
    result = "";
}

console.log(result);

CodePudding user response:

C# .Net Solution

(>[\s\S]*\.)

Or if you don't want to capture the > and . then you can use a positive look behind and positive look ahead.

  • To match all characters and whitespace in between we can use [\s\S]* This works by default in .Net due to aggressive matching

(?<=>)([\s\S]*)(?=\.)

Try this fiddle: https://dotnetfiddle.net/3ukM0X

public static void Main()
{
    string content = @"Lorem ipsum dolor sit amet, consectetuer adipiscing elit, sed diam nonummy nibh euismod tincidunt ut laoreet dolore magna aliquam erat volutpat.

"">""Lorem ipsum dolor sit amet, consectetuer adipiscing elit, sed diam nonummy nibh euismod tincidunt ut laoreet dolore magna aliquam erat volutpat.
Lorem ipsum dolor sit amet, consectetuer adipiscing elit, sed diam nonummy nibh euismod tincidunt ut laoreet dolore magna aliquam erat volutpat.

Lorem ipsum dolor sit amet, consectetuer adipiscing elit, sed diam nonummy nibh euismod tincidunt ut laoreet dolore magna aliquam erat volutpat.";
    
    var regex = new System.Text.RegularExpressions.Regex(@"(?<=>)([\s\S]*)(?=\.)");
    Console.WriteLine(regex.Match(content));
}

Returns:

Lorem ipsum dolor sit amet, consectetuer adipiscing elit, sed diam nonummy nibh euismod tincidunt ut laoreet dolore magna aliquam erat volutpat.
Lorem ipsum dolor sit amet, consectetuer adipiscing elit, sed diam nonummy nibh euismod tincidunt ut laoreet dolore magna aliquam erat volutpat.

Lorem ipsum dolor sit amet, consectetuer adipiscing elit, sed diam nonummy nibh euismod tincidunt ut laoreet dolore magna aliquam erat volutpat

CodePudding user response:

If there can not be any > chars in the paragraph, only at the start, then you can use:

^>[^>]*\.
  • ^ Start of string
  • > Match literally
  • [^>]* Match optional chars other than >
  • \. backtrack to match the last occurrence of the dot

See a regex demo


If you want to allow the > char in the paragraph (but not at the start as that denotes the start of the paragraph) you can match all lines after it that do not start with <

^>.*(?:\r?\n(?!>).*)*\.

See another regex demo

  • Related