Home > Mobile >  Parsing nested capturing group in regex
Parsing nested capturing group in regex

Time:03-31

I have an input string like this: 1(43;46) 2(5;41)(7;91) 2(5;41)(7;91)

I want to parse the number before it, and the two numbers between the () braces. Multiple () braces are allowed, each for there own x/y position set.

TYPE(X;Y)

  • TYPE is my type variable (an enum, as integer)
  • X is my x position
  • Y is my y position
  • Every set ends with an space.

I have this regex statement for it:

(\d )((\((\d );(\d )\)) ) 

It works, but is not so clean. I think there is a better way to do the repeating part of my input string. Can anyone help me with parsing this in regex?

CodePudding user response:

I would use two levels of regular expressions:

  • an outer level to match an expression, i.e., a type followed by a list of positions, and
  • an inner level to match each position in a list.

The patterns for each expression can be enclosed in a raw string, R"()".

The outer pattern would capture:

  • the type, with (\d ), and
  • the list of positions. Each position should match \(\d ;\d \). Since we want a list of positions, we indicate that with (?:<position>) , where the ?: says we don't want to capture each position separately (yet). Then, we capture the whole list with (<list of positions>).

The inner pattern would capture two numbers, separated by a semicolon.

[Demo]

#include <iostream>  // cout
#include <regex>  // regex_search, smatch
#include <string>

int main() {
    std::string input{"1(43;46) 2(5;41)(7;91) 3(6;42)(8;92)"};
    std::smatch matches_expr{};
    std::regex pattern_expr{R"((\d )((?:\(\d ;\d \)) ))"};
    while (std::regex_search(input, matches_expr, pattern_expr)) {
        std::string type{matches_expr[1]};
        std::cout << type;

        std::smatch matches_xy{};
        std::string xy_list{matches_expr[2]};
        std::regex pattern_xy{R"((\d );(\d ))"};
        while (std::regex_search(xy_list, matches_xy, pattern_xy)) {
            std::cout << "(" << matches_xy[1] << ";" << matches_xy[2] << ")";
            xy_list = matches_xy.suffix();
        }
        std::cout << " ";

        input = matches_expr.suffix();
    }
}

// Outputs:
//
//   1(43;46) 2(5;41)(7;91) 3(6;42)(8;92)

CodePudding user response:

You are asking for a regex. But I want to recommend a different approach.

C is an object oriented language. So you can have data, and methods, operating on those data. Only the methods should work on the data. Not the outside world. That is called encapsulation. I will make your software more robust and easier to change.

Next: To solve a problem like yours, you should always split the big problem into smaller and smaller problems. Then very small problems are usually easy to solve.

You need to understand that in C often formatted input functions (like with >>) can be used to solve such tasks easily.

Let's use all the above principles and tackle the problem from that point of view. We drill down and look at the smallest entity that needs to be read. That is a position, given in the format "(xx;yy)". And this could simply be done with formatted IO.

You could define 3 char variables "c0", "c1" and "c2" and, additionally integers "x" and "y" for the coordinates.

With that, a position may be simply read with

is >> c0 >> x >> c1 >> y >> c2;

"c0" will contain a '(', "c1" will have a ';' and "c2" will contain the closing bracket ')'. More than simple. No need to keep the chars . . .

Then we can go up and read a group os positions, including a type. This we can simply do, by first reading a complete group into a string. So, a thing like "2(5;41)(7;91)" can be read, because formatted input functions will read until the next white space.

The string will be put into a std::istringstream and then we can extract everything from there. First, the type, and then, in a lopp, all positions. Especially the last part will be simple, because we have already a function, the will extract one position.

The last step is to simply extract all those groups of positions from the stream until eof.

Do not underestimate the advantage of having all theses classes. You may add simply additional methods to the classes and enhance the functionality. Without disturbing any other function. That is the OO approach . . .

Please see one possible solution, using the above described approach

#include <iostream>
#include <fstream>
#include <sstream>
#include <vector>
#include <string>


using TYPE = int;

// Smallest data entity. A position, consisting of x and y coordinate
struct Position {

    // Data part: X and Y coordinate
    int x{};
    int y{};

    // Simple IO. Read value with standard formatted io
    friend std::istream& operator >> (std::istream& is, Position& pos) {

        // We are not interested in brackets and semicolons. Throw away.
        char tmp;

        // Simple reading of for example (12;34)
        return is >> tmp >> pos.x >> tmp >> pos.y >> tmp;
    }

    // Simple inserter function
    friend std::ostream& operator << (std::ostream& os, const Position& pos) {
        return os << '(' << pos.x << ';' << pos.y << ')';
    }
};

// next bigger data entity: A type, followed by many positions. so, a group of positions
struct Group {

    // The data part
    TYPE type{};                        // Type
    std::vector<Position> position{};   // Many positions

    // Simple extraction from stream
    friend std::istream& operator >> (std::istream& is, Group& group) {

        // Clear old data and define some temporary variables
        group.position.clear();
        std::string groupAsString{};
        Position pos{};

        // Read a group as a string. This will read a ytpe and positions, up to the next space, into a string
        // For example: the String "2(5;41)(7;91)" will be read int the std::string "group"
        is >> groupAsString;

        // Next, we want to extract the data from this group. So, put it into a stringstream
        std::istringstream issGroup{ groupAsString };

        // Get the type
        issGroup >> group.type;

        // And read all positions, using the above extraction function
        while (issGroup >> pos)
            group.position.push_back(pos);
        return is;
    }
    // Simple inserter
    friend std::ostream& operator << (std::ostream& os, const Group& group) {
        os << group.type;
        for (const Position& pos : group.position)
            os << pos;
        return os;
    }
};

// So, this is the biggest entity. All groups of positions with their type
struct Data {
    // Data part
    std::vector<Group> groups{};

    // Simple input, extract data from stream using above extraction functions
    friend std::istream& operator >> (std::istream& is, Data& data) {

        // Delete old data
        data.groups.clear();

        // Read groups of positions until end of stream
        Group group{};
        while (is >> group)
            data.groups.push_back(group);
        return is;
    }
    // Simple inserter function
    friend std::ostream& operator << (std::ostream& os, const Data& data) {
        for (const Group& group : data.groups)
            os << group << '\n';
        return os << '\n';
    }
};

int main() {
    // Test data. Can be an open file or any stream, including std::cin
    std::istringstream test{ R"(1(43;46) 2(5;41)(7;91) 2(5;41)(7;91)
3(43;46)(31;31)(32;32) 4(5;41)(7;91)(5;41)(7;91)
5(5;41)(7;91)(43;46)(31;31)(32;32) 6(5;41)(7;91)(5;41)(7;91)(5;41)(7;91)(5;41)(7;91)
)" };

    // Here we will store our data
    Data data{};

    // Simple extraction of all positions groups
    test >> data;

    // Simple output
    std::cout << data;
}
  • Related