Home > Software engineering >  How do I regex match each individual word within backticks?
How do I regex match each individual word within backticks?

Time:09-19

I am trying to get results for each individual word within backticks. For example, if I have something like this text

some description `match these_words th_is_wor` or `THIS_WOR thi_sqw` a `word_snake`

I want the search results to be:

  • match
  • these_words
  • th_is_wor
  • THIS_WOR
  • thi_sqw
  • word_snake

I'm essentially trying to get each "word", word being one or more english letter or underscore characters, between each set of backticks.

I currently have the following regex that seems to match ALL the text between each set of backticks:

/(?<=`)(\b([^`\]|\w|_)*\b)(?=`)/gi

This uses a positive lookbehind to find text that comes after a ` character: (?<=`)

Followed by a capture group for one or more things such that the thing is not a `, not a \, is a word character, or is an _ character within word boundaries: (\b([^`\]|\w|_)*\b)

Followed by a positive lookahead for another ` character to ensure we're enclosed within backticks.

This sort of works, but captures ALL the text between backticks instead of each individual word. This would require further processing which I'd like to avoid. My regex results right now are:

  • match these_words th_is_wor
  • THIS_WOR thi_sqw
  • word_snake

If there is a generic formula for getting each individual word within backticks or within quotes, that would be fantastic. Thank you!

Note: Much appreciated if the answer could be formatted in C#, but not required, as I can do that bit myself if needed.

Edit: Thank you Mr. إين from Ben Awad's Discord server for the quickest response! This is the solution as proposed by him. Also thank you to everyone who responded to my post, you guys are all AWESOME!

using System;
using System.Text.RegularExpressions;
class Program {
  static void Main(string[] args) {
    string backtickSentence = "i want to `match these_words th_is_wor` or `THIS_WOR thi_sqw` a `word_snake`";
    string backtickPattern = @"(?<=^[^`]*(?:`[^`]*`[^`]*)*`(?:[^`]* )*)\w ";
    string quoteSentence = "some other \"words in a \" sentence be \"gettin me tripped_up AllUp inHere\"";
    string quotePattern = "(?<=^[^\"]*(?:\"[^\"]*\"[^\"]*)*\"(?:[^\"]* )*)\\w ";
    // Call Matches method without specifying any options.
    try {
      foreach (Match match in Regex.Matches(backtickSentence, backtickPattern, RegexOptions.None, TimeSpan.FromSeconds(1)))
        Console.WriteLine("Found '{0}' at position {1}", match.Value, match.Index);

      Console.WriteLine();
      foreach (Match match in Regex.Matches(quoteSentence, quotePattern, RegexOptions.None, TimeSpan.FromSeconds(1)))
        Console.WriteLine("Found '{0}' at position {1}", match.Value, match.Index);
    }
    catch (RegexMatchTimeoutException) {} // Do Nothing: Assume that timeout represents no match.

    Console.WriteLine();
    // Call Matches method for case-insensitive matching.
    try {
      foreach (Match match in Regex.Matches(backtickSentence, backtickPattern, RegexOptions.IgnoreCase))
        Console.WriteLine("Found '{0}' at position {1}", match.Value, match.Index);

      Console.WriteLine();
      foreach (Match match in Regex.Matches(quoteSentence, quotePattern, RegexOptions.IgnoreCase))
        Console.WriteLine("Found '{0}' at position {1}", match.Value, match.Index);
    }
    catch (RegexMatchTimeoutException) {}
  }
}

His explanation for this was as follows, but you can paste his regex into regexr.com for more info

var NOT_BACKTICK = @"[^`]*";
var WORD = @"(\w )";

var START = $@"^{NOT_BACKTICK}"; // match anything before the first backtick
var INSIDE_BACKTICKS = $@"`{NOT_BACKTICK}`"; // match a pair of backticks
var ODD_NUM_BACKTICKS_BEFORE = $@"{START}({INSIDE_BACKTICKS}{NOT_BACKTICK})*`"; // match anything before the first backtick, then any amount of paired backticks with anything afterwards, then a single opening backtick

var CONDITION = $@"(?<={ODD_NUM_BACKTICKS_BEFORE})";
var CONDITION_TRUE = $@"(?: *{WORD})"; // match any spaces then a word
var CONDITION_FALSE = $@"(?:(?<={ODD_NUM_BACKTICKS_BEFORE}{NOT_BACKTICK} ){WORD})"; // match up to an opening backtick, then anything up to a space before the current word


// uses conditional matching
// see https://learn.microsoft.com/en-us/dotnet/standard/base-types/alternation-constructs-in-regular-expressions#Conditional_Expr
var pattern = $@"(?{CONDITION}{CONDITION_TRUE}|{CONDITION_FALSE})";

// refined backtick pattern
string backtickPattern = @"(?<=^[^`]*(?:`[^`]*`[^`]*)*`(?:[^`]* )*)\w ";

CodePudding user response:

With C# you can use the Group.Captures Property and then get the capture group values.

Note that \w also matches _

`(?:[\p{Zs}\t]*(\w )[\p{Zs}\t]*) `

Explanation

  • <code> Match literally
  • (?: Non capture group to repeat as a whole part
    • [\p{Zs}\t]* Match optional spaces
    • (\w ) Capture group 1, match 1 word characters
    • [\p{Zs}\t]* Match optional spaces
  • ) Close the non capture group and repeat as least 1 or more times
  • <code> Match literally

See a .NET regex demo and a C# demo.

For example:

string s = @"some description ` match these_words th_is_wor ` or `THIS_WOR thi_sqw` a `word_snake`";
string pattern = @"`(?:[\p{Zs}\t]*(\w )[\p{Zs}\t]*) `";
foreach (Match m in Regex.Matches(s, pattern))
{
    string[] result = m.Groups[1].Captures.Select(c => c.Value).ToArray();
    Console.WriteLine(String.Join(',', result));
}

Output

match,these_words,th_is_wor
THIS_WOR,thi_sqw
word_snake

CodePudding user response:

For chaining matches to a backtick followed by a word boundary the \G anchor can be used:

 (?:\G(?!^)[^\w`] |`\b)(\w )
  • `\b set the starting point for the chain
  • \G(?!^)[^\w`] continue where the previous match ended (the neg. lookahead prevents \G from matching at start) and consume characters that are not word characters or backtick
  • (\w ) each word gets captured to the first group (.NET demo)

See this demo at regex101 or a .NET variant without capturing group at regexstorm
This pattern does not check for the second backtick (would need another lookahead).

  • Related