Home > Blockchain >  How to find some specific sub string using the regular expression from a line
How to find some specific sub string using the regular expression from a line

Time:10-28

I can match #include<stdio.h> using the following regular expression in c .

regex ("( )*#( )*include( )*<(stdio.h)( )*>( )*")

But if I design a regular expression like regex("( )*#( )*include( )*<(.)*.h( )*>( )*") in cpp then I find any type of header file. But if I want to get a sub string from a header file like,

Suppose I have some header file like,
#include<string.h>
#include<math.h>
#include<stdlib.h>
#include<time.h>
And from those header file, I just want to get the sub string like, 
string.h
math.h
stdlib.h
time.h

In simply, I want to get the string inside this symbol < >

Now my Question is how to design a regular expression and write a c code so that I can get my expected sub string from any header file?

or

Write a c code to print the string inside this symbol < >
using this regular expression regex("( )*#( )*include( )*<(.)*.h( )*>( )*") ?

I just design the regular expression regex("( )*#( )*include( )*<(.)*.h( )*>( )*").
I can't find any idea to print the string inside this symbol < >

CodePudding user response:

"Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems."

For your problem, we can use std::regex_match to find header file name like this.

std::vector<std::string> v{"#include<string.h>", "#include<math.h>", "#include<stdlib.h>","#include<time.h>"};
    std::regex self_regex("#include\s*(<([^\"<>|\b] )>|\"([^\"<>|\b] )\")", std::regex_constants::ECMAScript | std::regex_constants::icase);

    for(auto s: v) {
        std::smatch base_match;
        if (std::regex_match(s, base_match, self_regex)) {
            // The first sub_match is the whole string; the next
            // sub_match is the first parenthesized expression.
            std::cout << "match\n";
            for(size_t i = 0; i < base_match.size();   i ) {
                std::ssub_match sub_match = base_match[i];
                std::string piece = sub_match.str();
                std::cout << "  submatch " << i << ": " << piece << '\n';
            }
        
        }
    }

Result:

match
  submatch 0: #include<string.h>
  submatch 1: <string.h>
  submatch 2: string.h
  submatch 3: 
match
  submatch 0: #include<math.h>
  submatch 1: <math.h>
  submatch 2: math.h
  submatch 3: 
match
  submatch 0: #include<stdlib.h>
  submatch 1: <stdlib.h>
  submatch 2: stdlib.h
  submatch 3: 
match
  submatch 0: #include<time.h>
  submatch 1: <time.h>
  submatch 2: time.h
  submatch 3: 

CodePudding user response:

Alas, Regular Expressions are not easy. There is a common aphorism among programmers that goes:

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.

You need to spend more time studying how to form regular expressions. The most notable site people point to is Regular-Expressions.info, but I personally find it a bit dense. Googling around for online regex toyboxes helps a lot too.

In your particular case, you have several problems to overcome. The simplest way would be to

  1. Read the source file one line at a time
  2. Check each line for a match
  3. Extract the matching substring

For an #include directive, you should know what characters are valid in between those two angle brackets, as well as the general valid design for a preprocessor #include directive itself, which you almost have.

Here is a little toy that finds lines that match without bothering to check for valid filenames.

#include <iomanip>
#include <iostream>
#include <regex>
#include <string>

int main()
{
  std::regex re{ R"<>(^\s*#\s*include\s*<(. )>\s*$)<>" };
  std::smatch m;
  std::string s;
  while (getline( std::cin, s ))
  {
    if (regex_match( s, m, re ))
    {
      // --> You _could_ check if the filename is valid here first, if needed. <-- //
      std::cout << std::quoted( m[1].str() ) << "\n";
    }
  }
}

Notice that the regular expression:

^\s*#\s*include\s*<(. )>\s*$

includes redundant BOL and EOL markers (^ and $ respectively), since std::regex_match() matches the entire string. You could remove them. Just be careful if you ever update your code to use, say, a regular expression iterator or the like.

Notice also that the sub-expression match is greedy, gobbling everything it can until the final > at the end of the line. This needs more attention should you want to use a regex iterator.

Ideally, however, I think you should avoid the regex altogether and just parse the string using some simpler string processing.

  • Related