Home > front end >  regex_token_iterator<> sometimes misses matched substrings
regex_token_iterator<> sometimes misses matched substrings

Time:11-16

I used regex_token_iterator<> to get all matched substrings in a line, as suggested in this question. But the code sometimes misses 2nd matched substrings in lines, and the lines where this miss happens changes at different runs. Is this a bug of regex_token_iterator<>, or is there something wrong in my code? The compiler I used is Apple clang version 14.0.0 (clang-1400.0.29.202), and I used -std=c 14 to compile the following code.

I also tried another suggestion in the question above, which is to use while-loop to repeatedly apply regex_search(), and that version of code worked properly. I just want to know why the version with regex_token_iterator<> is not working, whether my usage is wrong or not.

code:

#include<regex>
#include<iostream>
#include<string>
#include<fstream>
#include<sstream>

using namespace std;

struct bad_from_string : bad_cast{
  const char* what() const noexcept override{
    return "bad cast from string";
  }
};

template<typename T>
T from_string(const string& s){
  istringstream is{s};
  T t;
  if(!(is>>t))
    throw bad_from_string{};
  return t;
}

int main(){
  regex pat{R"((\d{1,2})/(\d{1,2})/(\d{4}))"}; // e.g. 7/21/2022
  ifstream ifs{"test_regex_token_iterator.txt"};
  ofstream ofs{"test_out_regex_token_iterator.txt"};

  regex_token_iterator<string::iterator> rend; // default constructor is used for indicating the end of the sequence
  
  for(string line; getline(ifs, line);){
    smatch matches;
    
    string replace_pattern; 

    int month{0}, day{0}, year{0};

    regex_token_iterator<string::iterator> riter(line.begin(), line.end(), pat);
      
    // for each matched substring, replace it individually
    while(riter!=rend){
      string matched_substring{(*riter).str()};
      // *riter returns a reference to the sub_match object riter is pointing to.
      // sub_match is not a string. sub_match::str() returns the string of the sub_match.
      
      // put each matched substring into variable "matches"
      regex_search(matched_substring, matches, pat);
      
      // get the day, month, and year values in int
      day = from_string<int>(matches.str(2));
      month = from_string<int>(matches.str(1));
      year = from_string<int>(matches.str(3));
      
      // here make replace_pattern yyyy-mm-dd
      if(month<10 && day<10)
        replace_pattern = to_string(year) "-0" to_string(month) "-0" to_string(day); // both day and month need the fron '0'
      else if(month<10)
        replace_pattern = to_string(year) "-0" to_string(month) "-" to_string(day);
      else if(day<10)
        replace_pattern = to_string(year) "-" to_string(month) "-0" to_string(day);
      else
        replace_pattern = to_string(year) "-" to_string(month) "-" to_string(day);
      
      line = regex_replace(line, regex(matched_substring), replace_pattern); // regex_replace() returns a string
      // since I want to replace only 1 matched substring *riter, I use the exact substring 
      // in the place of regex pattern
      
        riter;      // move to the next matched substring
    }
  
    ofs << line << endl; 
  }
  
  return 0;
}

test_regex_token_iterator.txt:

12/01/2022 - 12/31/2022
12/01/2022 - 12/31/2022
12/01/2022 - 12/31/2022
12/01/2022 - 12/31/2022

10/01/2022 - 10/31/2022
10/01/2022 - 10/31/2022
10/01/2022 - 10/31/2022
10/01/2022 - 10/31/2022
10/01/2022 - 10/31/2022

sample test_out_regex_token_iterator.txt (but the result changes in different runs):

2022-12-01 - 12/31/2022
2022-12-01 - 2022-12-31
2022-12-01 - 12/31/2022
2022-12-01 - 12/31/2022

2022-10-01 - 10/31/2022
2022-10-01 - 2022-10-31
2022-10-01 - 10/31/2022
2022-10-01 - 10/31/2022
2022-10-01 - 10/31/2022

I expected all the matched substrings, including the dates in the 2nd column, were replaced, but only part of them were replaced properly. The expected result:

2022-12-01 - 2022-12-31
2022-12-01 - 2022-12-31
2022-12-01 - 2022-12-31
2022-12-01 - 2022-12-31

2022-10-01 - 2022-10-31
2022-10-01 - 2022-10-31
2022-10-01 - 2022-10-31
2022-10-01 - 2022-10-31
2022-10-01 - 2022-10-31

CodePudding user response:

enabling address sanitiser shows that your code is causing undefined behaviour: https://godbolt.org/z/n3rnn9nqY

riter contains iterators from line but at the end of your while loop you reassign line, invalidating line's iterators and therefore invalidating riter, when you then try to increment riter you enter the realms of undefined behaviour.

Adding a separate string for your output fixes the problem: https://godbolt.org/z/Grqe1vv5x

for(string line; getline(ifs, line);){
  smatch matches;
  string outputLine = line;
  
  string replace_pattern; 

  int month{0}, day{0}, year{0};

  regex_token_iterator<string::iterator> riter(line.begin(), line.end(), pat);
    
  // for each matched substring, replace it individually
  while(riter!=rend){
    string matched_substring{(*riter).str()};
    // *riter returns a reference to the sub_match object riter is pointing to.
    // sub_match is not a string. sub_match::str() returns the string of the sub_match.
    
    // put each matched substring into variable "matches"
    regex_search(matched_substring, matches, pat);
    
    // get the day, month, and year values in int
    day = from_string<int>(matches.str(2));
    month = from_string<int>(matches.str(1));
    year = from_string<int>(matches.str(3));
    
    // here make replace_pattern yyyy-mm-dd
    if(month<10 && day<10)
      replace_pattern = to_string(year) "-0" to_string(month) "-0" to_string(day); // both day and month need the fron '0'
    else if(month<10)
      replace_pattern = to_string(year) "-0" to_string(month) "-" to_string(day);
    else if(day<10)
      replace_pattern = to_string(year) "-" to_string(month) "-0" to_string(day);
    else
      replace_pattern = to_string(year) "-" to_string(month) "-" to_string(day);
    
    outputLine = regex_replace(outputLine, regex(matched_substring), replace_pattern); // regex_replace() returns a string
    // since I want to replace only 1 matched substring *riter, I use the exact substring 
    // in the place of regex pattern
    
      riter;      // move to the next matched substring
  }

  ofs << outputLine << endl; 
}
  • Related