Home > Blockchain >  RegEx to only match single occurence of a keyword
RegEx to only match single occurence of a keyword

Time:05-19

I'm having a hard time trying to compose a RegEx to meet my specific requirements.

These are:

  1. Match keyword and capture the date that follows
  2. If keyword is not present capture nothing
  3. If keyword is present more than once, capture nothing

Keyword:

 LT circa

Example Text:

Metall-Notierung 464,95 EUR 100 KG
* LT circa 21.04.2020 2 x 500 M Einwegtrommel 400x 150x 404mm
* LT circa 17.05.2020 2 x 500 M Einwegtrommel 400x 150x 404mm
Zolltarifnummer 80464995

Expected Result:

NULL

Example Text:

Metall-Notierung 464,95 EUR 100 KG
* LT circa 17.05.2020 2 x 500 M Einwegtrommel 400x 150x 404mm
Zolltarifnummer 80464995

Expected Result:

17.05.2020

Beeing a Newbie to RegEx these are the things I have tried so far on a simplified subject:

This test is a test and nothing else

(.*test.*test.*)?(?(1)(a^):(test.*))

...as you might expect, it would be naive to think that this could work.

Experts anyone?

Edit:

I checked using .NET Framework 4.7.2 and NUnit

using NUnit.Framework;
using System.Collections;
using System.Text;
using System.Text.RegularExpressions;

namespace Test.RegExpressions.Tests
{
   [TestFixture]
   public class SpecialRegexTests
   {
      [TestCaseSource(typeof(TestCaseClass), nameof(TestCaseClass.TestCases))]
      public int MatchTest(string input, string pattern, RegexOptions regexOptions)
      {
         return new Regex(pattern, regexOptions).Matches(input).Count;
      }
   }

   public static class TestCaseClass
   {
      private static readonly string S0 = new StringBuilder()
         .AppendLine("Metall - Notierung 464,95 EUR 100 KG")
         .AppendLine("* LT circa 21.04.2020 2 x 500 M Einwegtrommel 400x 150x 404mm")
         .AppendLine("* LT circa 17.05.2020 2 x 500 M Einwegtrommel 400x 150x 404mm")
         .AppendLine("Zolltarifnummer 80464995")
         .ToString();

      private static readonly string S1 = new StringBuilder()
         .AppendLine("Metall - Notierung 464,95 EUR 100 KG")
         .AppendLine("* LxT circa 21.04.2020 2 x 500 M Einwegtrommel 400x 150x 404mm")
         .AppendLine("* LT circa 17.05.2020 2 x 500 M Einwegtrommel 400x 150x 404mm")
         .AppendLine("Zolltarifnummer 80464995")
         .ToString();

      private const string R0 = @"^(?:(?!.*LT circa). \n)*(?:(?!LT circa).)*LT circa\s (\d\d\.\d\d.\d{4})(?!(?:. \n)*.*LT circa)";

      private const string R1 = @"(?s)^(?!(?:.*LT circa){2}).*LT circa\s*\K\d{1,2}\.\d{1,2}\.\d{4}";

      private const string R2 = @"(?s)^(?!(?:.*LT circa){2}).*LT circa\s*(\d{1,2}\.\d{1,2}\.\d{4})";

      public static IEnumerable TestCases
      {
         get
         {
            yield return new TestCaseData(S0, R0, RegexOptions.None).Returns(0);
            yield return new TestCaseData(S1, R0, RegexOptions.None).Returns(1);

            yield return new TestCaseData(S0, R1, RegexOptions.None).Returns(0);
            yield return new TestCaseData(S1, R1, RegexOptions.None).Returns(1);
            
            yield return new TestCaseData(S0, R2, RegexOptions.None).Returns(0);
            yield return new TestCaseData(S1, R2, RegexOptions.None).Returns(1);
         }
      }
   }
}

Except for R1 which uses the \K all of them pass the test.

I will update my question as soon as I have more info on the Regex Flavor in use.

Worth to mention, that none of these worked in the Software, which may or may not be a matter of RegEx options I don't have control over.

CodePudding user response:

You may try this regex with negative look-aheads. It is slightly longer but will be more efficient than using DOTALL mode:

^(?:(?!.*LT circa). \n)*(?:(?!LT circa).)*LT circa\s (\d\d\.\d\d.\d{4})(?!(?:. \n)*.*LT circa)

.NET RegEx Demo

CodePudding user response:

You can use

(?s)^(?!(?:.*LT circa){2}).*LT circa\s*\K\d{1,2}\.\d{1,2}\.\d{4}

See the regex demo. The date regex can be enhanced, but the main point is the pattern around it.

Details:

  • (?s) - s flag making . match any characters
  • ^ - start of string
  • (?!(?:.*LT circa){2}) - fail the match if there are two occurrences of LT circa anywhere in the string
  • .* - any zero or more chars as many as possible
  • LT circa - the keyword
  • \s* - zero or more whitespaces
  • \K - mathc reset operator discarding all text matched so far
  • \d{1,2}\.\d{1,2}\.\d{4} - date like pattern. (?:0?[1-9]|[12]\d|3[01])\.(?:0?[1-9]|1[0-2])\.\d{4}(?!\d) can be a bit more precise pattern for an arbitrary dd/MM/yyyy date (without leap year support).
  • Related