Home > OS >  find specific pattern of digits in a string
find specific pattern of digits in a string

Time:12-15

Consider the following strings:

"via caporale degli zuavi 278a , 78329" 

and

"autostrada a1 km - 47"

I am looking to isolate a specific sequence that can be present (first example) or not (second example)

In particular, i am looking for a sequence of digit that can be long 1 to 4 digit and can be followed by a single letter, but also in the string there must not be the substring "km". So in my previous example "278a" is valid but the rest of the sequence of digit are not.

What i've done until now is the following:

Since i know that any string that contains "km" is not valid i applied this piece of code:

if(!stripped.ToLower().Contains("km"))
{
    // apply Regex
}
else
    // string not valid, move on

I know that this Regex will give me all the squence of digits : Regex.Matches(t, @"\d "); , but it is not enough. How can i proceed from here?

Edit: for further clarification, when a sequence of digit is followed by a letter, that letter must be the next char (so no whitespace or anything else)

Edit2: note that the sequence of digit can be followed by a letter or not (so 278a is as valid as 278)

CodePudding user response:

You can assert not km to the left and right, and capture 1-4 digits 0-9 in a group and match and a char a-zA-Z:

(?<!\bkm\b.*)\b[0-9]{1,4}[A-Za-z]?\b(?!.*\bkm)
  • (?<!\bkm\b.*) Assert not km to the left
  • \b[0-9]{1,4}[A-Za-z]\b Match 1-4 digits 0-9 and match a single char A-Za-z
  • (?!.*\bkm) Assert not km to the right

.NET Regex demo

string pattern = @"(?<!\bkm\b.*)\b[0-9]{1,4}[A-Za-z]?\b(?!.*\bkm)";
string input = @"via caporale degli zuavi 278a , 78329
via caporale degli zuavi 277 , 78329
via caporale degli zuavi 279a , 78329 km
km via caporale degli zuavi 280a , 78329
autostrada a1 km - 47";

foreach (Match m in Regex.Matches(input, pattern))
{
    Console.WriteLine(m.Value);
}

Output

278a
277

If there is only 1 match expected, you might also rule out km in the whole string, and use a capture group as well with Regex.Match

^(?!.*\bkm\b).*\b([0-9]{1,4}[A-Za-z]?)\b

Regex demo

CodePudding user response:

You can use

^(?!.*(?<!\p{L})km\b)(?:.*\D)?(\d{1,4})(?=\p{L}?\b)

See the .NET regex demo. Details:

  • ^ - start of string
  • (?!.*(?<!\p{L})km\b) - no km without any letter preceding the word and no alphanumeric/underscore following it is allowed anywhere in the string
  • (?:.*\D)? - an optional sequence of any zero or more chars other than a newline char, as many as possible, and then a non-digit char
  • (\d{1,4}) - Grooup 1: one to four digits
  • (?=\p{L}?\b) - immediately on the right, there should be an optional letter not followed with any alphanumeric or connector punctuation (like _).

See a C# demo:

var l = new List<string> {"via caporale degli zuavi 278a , 78329","autostrada a1 km - 47"};
foreach (var t in l) 
{
    var rx = @"^(?!.*(?<!\p{L})km\b)(?:.*\D)?(\d{1,4})(?=\p{L}?\b)";
    var match = Regex.Match(t, rx, RegexOptions.ECMAScript)?.Groups[1].Value;
    if (!string.IsNullOrEmpty(match))
    {
        Console.WriteLine($"There is a match in '{t}': {match}");
    } 
    else
    {
        Console.WriteLine($"There is no match in '{t}'.");
    }
}

Output:

There is a match in 'via caporale degli zuavi 278a , 78329': 278
There is no match in 'autostrada a1 km - 47'.

The RegexOptions.ECMAScript option is used to make \d only match ASCII digits (it does not affect \p{L} though).

  • Related