Home > Enterprise >  How to get the list of replacement strings from a Regex
How to get the list of replacement strings from a Regex

Time:08-02

I need to extract the replacement strings from a Regex.Replace() call. As an example:

  • Input string: October 1, 2020 f october 23, 1995 sdf october 12, 1999.
  • Pattern: (october)(\s \d{1,2}),?(\s (?:19|20)\d{1,2})(?=\s (?!(?:to|through|thru)\b)\p{L} )
  • Replacement: $1$2,$3,

What I need is the list of replacement strings as RegexBuddy shows at the bottom:

enter image description here

The list in this case would be:

  • October 1, 2020,
  • october 23, 1995,

Note these are the replacements done, not the matches (adding a comma at the end)

I'm using C# in my code but any language would do.

Code:

void Main()
{
    var input = "October 1, 2020 f october 23, 1995 sdf october 12, 1999.";
    var pattern = @"(october)(\s \d{1,2}),?(\s (?:19|20)\d{1,2})(?=\s (?!(?:to|through|thru)\b)\p{L} )";
    var replaced = Regex.Replace(input, pattern, "$1$2,$3,", RegexOptions.IgnoreCase);
}

CodePudding user response:

If you need to capture the matches then you can use this overload:

Regex.Replace(String, String, MatchEvaluator, RegexOptions)
In a specified input string, replaces all strings that match a specified regular expression with a string returned by a MatchEvaluator delegate. Specified options modify the matching operation.

evaluator MatchEvaluator
A custom method that examines each match and returns either the original matched string or a replacement string.

evaluator is a delegate that allows us to process the actual match, rather than doing it automatically. This is intended for you to implement additional or conditional logic on the match, but we can take advantage of this and capture the individual replacements:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text.RegularExpressions;

namespace RegexTester
{
    class Program
    {
        static void Main(string[] args)
        {
            var input = "October 1, 2020 f october 23, 1995 sdf october 12, 1999.";
            var pattern = @"(october)(\s \d{1,2}),?(\s (?:19|20)\d{1,2})(?=\s (?!(?:to|through|thru)\b)\p{L} )";
            Console.WriteLine(); 
            Console.WriteLine("Original Text: {0}", input);
            Console.WriteLine("Regex Evalutation Steps:");
            List<string> replacements = new List<string>();
            var replaced = Regex.Replace(input, pattern, (Match match) =>
            {
                // apply the match manually
                var result = match.Result("$1$2,$3,");
                // capture the individual results
                replacements.Add(result);

                // Write it out for demo purposes
                Console.WriteLine($"{match.Value} => {result}");

                // The delegate MUST return the result
                return result;
            }, RegexOptions.IgnoreCase);

            Console.WriteLine();
            Console.WriteLine("Replaced Text: {0}", replaced);

            Console.WriteLine("Matches Found:");
            foreach(var r in replacements)
            {
                Console.WriteLine(r);
            }
        }
    }
}

Output:

Original Text: October 1, 2020 f october 23, 1995 sdf october 12, 1999.
Regex Evalutation Steps:
October 1, 2020 => October 1, 2020,
october 23, 1995 => october 23, 1995,

Replaced Text: October 1, 2020, f october 23, 1995, sdf october 12, 1999.
Matches Found:
October 1, 2020,
october 23, 1995,

This solution is generic and will work for all patterns and text variants.


For the punters at home, this was my original post that shows how to use capture the matches separately, but highlights the flaws in such an approach.

If you want to capture the output from a Regex, then you should use the Regex.Matches() or Regex.Match() if you only need the first result.

You get access to all the internal capture groups for the individual or sub-clauses in your expression, but if you only need the text of each outer capture group, then use the .Value property on the System.Text.RegularExpressions.Capture result.

OP mentions that this is the original string, we want the replaced value...

To get the replaced values, we could re-run the replacement over the original captured strings, but the regex in it's current form won't work directly on these inputs because it only injects a comma if there is text following the match.

One hack to get around this is to inject a suffix onto the captured text before re-running the replacement. It will work for this specific example, but is not a generic solution as it requires assumed knowledge about the expression.

using System;
using System.Linq;
using System.Text.RegularExpressions;

namespace RegexTester
{
    class Program
    {
        static void Main(string[] args)
        {
            var input = "October 1, 2020 f october 23, 1995 sdf october 12, 1999.";
            var pattern = @"(october)(\s \d{1,2}),?(\s (?:19|20)\d{1,2})(?=\s (?!(?:to|through|thru)\b)\p{L} )";
            var replaced = Regex.Replace(input, pattern, "$1$2,$3,", RegexOptions.IgnoreCase);
            Console.WriteLine("Original Text: {0}", input);
            Console.WriteLine("Replaced Text: {0}", replaced);
            var matches = Regex.Matches(input, pattern, RegexOptions.IgnoreCase);
            var matchedStrings = matches.Select(x => x.Value).ToList();
            // NOTE: we need to ammend the input because this specific pattern only injects a comma if there is a following text entry
            string suffix = $" RANDOMTEXT.";
            // if you just need a list of them.
            var replacedStrings = matchedStrings.Select(x => Regex.Replace(x   suffix, pattern, "$1$2,$3,", RegexOptions.IgnoreCase))
                                                .Select(x => x.Substring(0, x.Length - suffix.Length))
                                                .ToList();

            Console.WriteLine("Replaced Strings:");
            foreach(var match in matchedStrings)
            {
                // writing out the source and the replacement as a proof:
                var replacement = Regex.Replace(match   suffix, pattern, "$1$2,$3,", RegexOptions.IgnoreCase);
                replacement = replacement.Substring(0, replacement.Length - suffix.Length);
                Console.WriteLine("{0} => {1}", match, replacement);
            }

        }
    }
}

Output:

Original Text: October 1, 2020 f october 23, 1995 sdf october 12, 1999.
Replaced Text: October 1, 2020, f october 23, 1995, sdf october 12, 1999.
Replaced Strings:
October 1, 2020 => October 1, 2020,
october 23, 1995 => october 23, 1995,
  • Related