Home > Software design >  Why on earth does this regex pattern only return the last instance?
Why on earth does this regex pattern only return the last instance?

Time:12-07

I have the following string that I'm trying to perform regex on:

040A0000 02CCDAD0 F9401401
040A0000 02CCDAD4 F8410021
040A0000 02CCDAD8 B4000041
040A0000 02CCDADC 52800015
040A0000 02CCDAE0 2A1503E1
040A0000 02CCDAE4 17DA29B5

My goal is to retrieve the last block of 8 characters, regardless of how many come before it. I am using the following pattern:

^(([\d\w] ){1,})?([\d\w] )$

Now, according to regex101, this pattern should work just fine: https://regex101.com/r/ZuWIPV/1

However, when running the following code:

    var reg = new Regex("^(([\\d\\w]  ){1,})?([\\d\\w] )$", RegexOptions.Multiline);
    if (reg.IsMatch(textBox1.Text))
    {
        var instructions = reg.Matches(textBox1.Text).Cast<Match>().Select(x => x.Groups[3].Value).ToArray();
        foreach (var instruction in instructions)
        {
            MessageBox.Show(instruction);
        }
    }

The only result I get is from the very last line:

17DA29B5

I was expecting to get all 6, like this:

F9401401
F8410021
B4000041
52800015
2A1503E1
17DA29B5

CodePudding user response:

First of all, you do not need to use [\w\d] as \w also matches digits. Next, when you define a regex in C#, you should use verbatim string literals to avoid overescaping. Also, only use capturing groups when you need to save some performance.

The issue is that you forgot to match an optional CR char at the end of each line. See Multiline Mode MSDN regex reference

By default, $ matches only the end of the input string. If you specify the RegexOptions.Multiline option, it matches either the newline character (\n) or the end of the input string. It does not, however, match the carriage return/line feed character combination. To successfully match them, use the subexpression \r?$ instead of just $.

You can use

var reg = new Regex(@"^(?:\w  )*(\w )\r?$", RegexOptions.Multiline);

To support any whitespaces on a line, you can use

var reg = new Regex(@"^(?:\w [\p{Zs}\t])*(\w )\r?$", RegexOptions.Multiline);

where [\p{Zs}\t] matches any horizontal whitespace.

And if you just want to match the last 8 ASCII hex chars at the end of each line you can just use

var reg = new Regex(@"[a-fA-F0-9]{8}\r?$", RegexOptions.Multiline);

Note in .NET \w matches all Unicode letters, digits, connector punctuation and even diacritic marks, so it might not be the best bet in this case. Alas, .NET regex has no shorthand for hex chars, like %x in Lua, [[:xdigit:]] in POSIX BRE/ERE, \p{XDigit} in Java, etc.

Why does regex101.com show correct matches?

At regex101.com, all line breaks are LF only, but in C#, on Windows, line endings are mostly CRLF. However, $ in the multiline matches only before the LF char.

When you need to test out a .NET regex, it is not a good idea to use regex101.com to validate the pattern as this regex testing site does not support .NET regex syntax (and Linux line endings). You can use RegexStorm.net where line breaks are set to CRLF.

CodePudding user response:

If you see the documentation about mulitline mode it has this:

By default, $ matches only the end of the input string. If you specify the RegexOptions.Multiline option, it matches either the newline character (\n) or the end of the input string. It does not, however, match the carriage return/line feed character combination. To successfully match them, use the subexpression \r?$ instead of just $.

Your textbox will have \r\n in for its newlines. This works as you'd expect on .net:

string Text = "040A0000 02CCDAD0 F9401401\n040A0000 02CCDAD4 F8410021\n040A0000 02CCDAD8 B4000041\n040A0000 02CCDADC 52800015\n040A0000 02CCDAE0 2A1503E1\n040A0000 02CCDAE4 17DA29B5\n";

//the regex can be simpler if you just want the last hex chars of a line
var reg = new Regex("([A-F0-9]{8})$", RegexOptions.Multiline);
var rr = reg.Matches(Text); //6 matches

I'd point out that if this data will have this regular presentation all the tine it doesn't even need a regex..

textbox.Lines.Select(line => line[^8..]) 

would also retrieve the last 8 chars from every line, though it wouldn't validate them for being hex (if that's important)

  • Related