Home > database >  Regex (JS Notation): Select spaces not in [ [], {}, "" ] to tokenize string
Regex (JS Notation): Select spaces not in [ [], {}, "" ] to tokenize string

Time:01-09

So I need to tokenize a string by all spaces not between quotes, I am using regex in Javascript notation.

For example:

" Test Test " ab c " Test" "Test " "Test" "T e s t"

becomes

[" Test Test ",ab,c," Test","Test ","Test","T e s t"]

For my use case however, the solution should work in the following test setting: https://www.regextester.com/

All Spaces not within quotes should be highlighted in the above setting. If they are highlighted in the above setting they would be parsed correctly in my program.

For more specificity, I am using Boost::Regex C to do the parsing as follows:

...
std::string test_string("\" Test Test \" ab c \" Test\" \"Test \" \"Test\" \"T e s t\"");
// (,|;)?\\s      : Split on ,\s or ;\s
// (?![^\\[]*\\]) : Ignore spaces inside []
// (?![^\\{]*\\}) : Ignore spaces inside {}
// (?![^\"].*\")  : Ignore spaces inside "" !!! MY ATTEMPT DOESN'T WORK !!!

//Note the below regex delimiter declaration does not include the erroneous regex.
boost::regex delimiter("(,|;\\s|\\s) (?![^\\[]*\\])(?![^\\(]*\\))(?![^\\{]*\\})");
std::vector<std::string> string_vector;
boost::split_regex(string_vector, test_string, delimiter);

For those of you who do not use Boost::regex or C the above link should enable testing of viable regex for the above use case.

Thank you all for you assistance I hope you can help me with the above problem.

CodePudding user response:

I would 100% not use regular expressions for this. First off, because it's way easier to express as a PEG grammar instead. E.g.:

std::vector<std::string> tokens(std::string_view input) {
    namespace x3 = boost::spirit::x3;
    std::vector<std::string> r;

    auto atom                            //
        = '[' >> *~x3::char_(']') >> ']' //
        | '{' >> *~x3::char_('}') >> '}' //
        | '"' >> *~x3::char_('"') >> '"' //
        | x3::graph;

    auto token = x3::raw[*atom];

    parse(input.begin(), input.end(), token %  x3::space, r);
    return r;
}

This, off the bat, already performs as you intend:

Live On Coliru

int main() {
    for (std::string const input : {R"(" Test Test " ab c " Test" "Test " "Test" "T e s t")"}) {
        std::cout << input << "\n";
        for (auto& tok : tokens(input))
            std::cout << " - " << quoted(tok, '\'') << "\n";
    }
}

Output:

" Test Test " ab c " Test" "Test " "Test" "T e s t"
 - '" Test Test "'
 - 'ab'
 - 'c'
 - '" Test"'
 - '"Test "'
 - '"Test"'
 - '"T e s t"'

BONUS

Where this really makes the difference, is when you realize that you wanted to be able to handle nested constructs (e.g. "string" [ {1,2,"3,4", [true,"more [string]"], 9 }, "bye ]).

Regular expressions are notoriously bad at this. Spirit grammar rules can be recursive though. If you make your grammar description more explicit I could show you examples.

CodePudding user response:

You can use multiple regexes if you are ok with that. The idea is to replace spaces inside quotes with a non-printable char (\x01), and restore them after the split:

const input = `" Test Test " ab c " Test" "Test " "Test" "T e s t"`;
let result = input
  .replace(/"[^"]*"/g, m => m.replace(/ /g, '\x01')) // replace spaces inside quotes
  .split(/  /) // split on spaces
  .map(s => s.replace(/\x01/g, ' ')); // restore spaces inside quotes
console.log(result);

If you have escaped quotes within a string, such as "a \"quoted\" token" you can use this regex instead:

const input = `"A \"quoted\" token" " Test Test " ab c " Test" "Test " "Test" "T e s t"`;
let result = input
  .replace(/".*?[^\\]"/g, m => m.replace(/ /g, '\x01')) // replace spaces inside quotes
  .split(/  /) // split on spaces
  .map(s => s.replace(/\x01/g, ' ')); // restore spaces inside quotes
console.log(result);

If you want to parse nested brackets you need a proper language parser. You can also do that with regexes however: Parsing JavaScript objects with functions as JSON

Learn more about regex: https://twiki.org/cgi-bin/view/Codev/TWikiPresentation2018x10x14Regex

  • Related