How to define boost tokenizer to return boost::iterator

I am trying to parse a file where each line is composed by attributes separated by ;. Each attribute is defined as key value or key=value, where key and value can be enclosed in double quotes " to allow for key and value containing special characters such as whitespace , equal sign = or semi-colon ;.

To do so, I use first boost::algorithm::make_split_iterator, and then, to allow for double quotes, I use boost::tokenizer.

I need to parse every key and value as a boost::iterator_range<const char*>. I tried coding as the code below, but I am unable to build it. It might be that the definition of the tokenizer is correct, but the error comes from the printing of the iterator_range. I can provide more information if necessary.

#include <boost/algorithm/string.hpp>
#include <boost/range/iterator_range.hpp>
#include <boost/tokenizer.hpp>

boost::iterator_range<const char*> line;

const auto topDelim = boost::token_finder(
  [](const char c) { return (c == ';'); },
  boost::token_compress_on);
for (auto attrIt = make_split_iterator(line, topDelim); !attrIt.eof() && !attrIt->empty(); attrIt  ) {
  std::string escape("\\");
  std::string delim(" =");
  std::string quote("\"");
  boost::escaped_list_separator<char> els(escape, delim, quote);
  boost::tokenizer<
    boost::escaped_list_separator<char>,
    boost::iterator_range<const char*>::iterator, // how to define iterator for iterator_range?
    boost::iterator_range<const char*>
  > tok(*attrIt, els);

for (auto t : tok) {
  std::cout << t << std::endl;
}

Build errors:

/third_party/boost/boost-1_58_0/include/boost/token_functions.hpp: In instantiation of 'bool boost::escaped_list_separator<Char, Traits>::operator()(InputIterator&, InputIterator, Token&) [with InputIterator = const char*; Token = boost::iterator_range<const char*>; Char = char; Traits = std::char_traits<char>]':
/third_party/boost/boost-1_58_0/include/boost/token_iterator.hpp:70:36:   required from 'void boost::token_iterator<TokenizerFunc, Iterator, Type>::initialize() [with TokenizerFunc = boost::escaped_list_separator<char>; Iterator = const char*; Type = boost::iterator_range<const char*>]'
/third_party/boost/boost-1_58_0/include/boost/token_iterator.hpp:77:63:   required from 'boost::token_iterator<TokenizerFunc, Iterator, Type>::token_iterator(TokenizerFunc, Iterator, Iterator) [with TokenizerFunc = boost::escaped_list_separator<char>; Iterator = const char*; Type = boost::iterator_range<const char*>]'
/third_party/boost/boost-1_58_0/include/boost/tokenizer.hpp:86:33:   required from 'boost::tokenizer<TokenizerFunc, Iterator, Type>::iter boost::tokenizer<TokenizerFunc, Iterator, Type>::begin() const [with TokenizerFunc = boost::escaped_list_separator<char>; Iterator = const char*; Type = boost::iterator_range<const char*>; boost::tokenizer<TokenizerFunc, Iterator, Type>::iter = boost::token_iterator<boost::escaped_list_separator<char>, const char*, boost::iterator_range<const char*> >]'
test.cpp:21:23:   required from here
/third_party/boost/boost-1_58_0/include/boost/token_functions.hpp:188:19: error: no match for 'operator =' (operand types are 'boost::iterator_range<const char*>' and 'const char')
  188 |           else tok =*next;
      |                ~~~^~~~~~~

CodePudding user response：

As I said, you want parsing, not splitting. Specifically, if you were to split the input into iterator ranges, you would have to repeat the effort of parsing e.g. quoted constructs to get the intended (unquoted) value.

I'd go by your specifications with Boost Spirit:

using Attribute = std::pair<std::string /*key*/, //
                            std::string /*value*/>;
using Line      = std::vector<Attribute>;
using File      = std::vector<Line>;

A Grammar

Now using X3 we can write expressions to define the syntax:

auto file      = x3::skip(x3::blank)[ line % x3::eol ];

Within a file, blank space (std::isblank) is generally skipped.

The content consists of one or more lines separated by newlines.

auto line      = attribute % ';';

A line consists of one or more attributes separated by ';'

auto attribute = field >> -x3::lit('=') >> field;
auto field     = quoted | unquoted;

An attribute is two fields, optionally separated by =. Note that each field is either a quoted or unquoted value.

Now, things get a little more tricky: when defining the field rules we want them to be "lexemes", i.e. any whitespace is not to be skipped.

auto unquoted = x3::lexeme[ (x3::graph - ';' - '=')];

Note how graph already excludes whitespace (see std::isgraph). In addition we prohibit a naked ';' or '=' so that we don't run into a next attribute/field.

For fields that may contain whitespace, and or those special characters, we define the quoted lexeme:

auto quoted      = x3::lexeme['"' >> *quoted_char >> '"'];

So, that's just "" with any number of quoted characters in between, where

auto quoted_char = '\\' >> x3::char_ | ~x3::char_('"');

the character can be anything escapped with \ OR any character other than the closing quote.

TEST TIME

Let's exercise *Live On Compiler Explorer

for (std::string const& str :
     {
         R"(a 1)",
         R"(b    = 2      )",
         R"("c"="3")",
         R"(a=1;two 222;three "3 3 3")",
         R"(b=2;three 333;four "4 4 4"
            c=3;four 444;five "5 5 5")",
         // special cases
         R"("e=" "5")",
         R"("f=""7")",
         R"("g="="8")",
         R"("\"Hello\\ World\\!\"" '8')",
         R"("h=10;i=11;" bogus;yup "nope")",
         // not ok?
         R"(h i j)",
         // allowing empty lines/attributes?
         "",
         "a 1;",
         ";",
         ";;",
         R"(a=1;two 222;three "3 3 3"

            n=1;gjb 222;guerr "3 3 3"
        )",
     }) //
{
    File contents;
    if (parse(begin(str), end(str), parser::file, contents))
        fmt::print("Parsed:\n\t- {}\n", fmt::join(contents, "\n\t- "));
    else
        fmt::print("Not Parsed\n");
}

Prints

Parsed:
    - {("a", "1")}
Parsed:
    - {("b", "2")}
Parsed:
    - {("c", "3")}
Parsed:
    - {("a", "1"), ("two", "222"), ("three", "3 3 3")}
Parsed:
    - {("b", "2"), ("three", "333"), ("four", "4 4 4")}
    - {("c", "3"), ("four", "444"), ("five", "5 5 5")}
Parsed:
    - {("e=", "5")}
Parsed:
    - {("f=", "7")}
Parsed:
    - {("g=", "8")}
Parsed:
    - {(""Hello\ World\!"", "'8'")}
Parsed:
    - {("h=10;i=11;", "bogus"), ("yup", "nope")}
Not Parsed
Not Parsed
Not Parsed
Not Parsed
Not Parsed
Not Parsed

Allowing empty elements

Is as simple as replacing line with:

auto line = -(attribute % ';');

To also allow redundant separators:

auto line = -(attribute %  x3::lit(';')) >> *x3::lit(';');

See that Live On Compiler Explorer

Insisting on Iterator Ranges

I explained above why I think this is a bad idea. Consider how you would correctly interpret the key/value from this line:

"\"Hello\\ World\\!\"" '8'

You simply don't want to deal with the grammar outside the parser. However, maybe your data is a 10 gigabyte memory mapped file:

using Field     = boost::iterator_range<std::string::const_iterator>;
using Attribute = std::pair<Field /*key*/, //
                            Field /*value*/>;

And then add x3::raw[] to the lexemes:

auto quoted      = x3::lexeme[x3::raw['"' >> *quoted_char >> '"']];

auto unquoted    = x3::lexeme[x3::raw[ (x3::graph - ';' - '=')]];

See it Live On Compiler Explorer