I used regex_token_iterator<> to get all matched substrings in a line, as suggested in this question. But the code sometimes misses 2nd matched substrings in lines, and the lines where this miss happens changes at different runs. Is this a bug of regex_token_iterator<>, or is there something wrong in my code? The compiler I used is Apple clang version 14.0.0 (clang-1400.0.29.202), and I used -std=c 14 to compile the following code.
I also tried another suggestion in the question above, which is to use while-loop to repeatedly apply regex_search(), and that version of code worked properly. I just want to know why the version with regex_token_iterator<> is not working, whether my usage is wrong or not.
code:
#include<regex>
#include<iostream>
#include<string>
#include<fstream>
#include<sstream>
using namespace std;
struct bad_from_string : bad_cast{
const char* what() const noexcept override{
return "bad cast from string";
}
};
template<typename T>
T from_string(const string& s){
istringstream is{s};
T t;
if(!(is>>t))
throw bad_from_string{};
return t;
}
int main(){
regex pat{R"((\d{1,2})/(\d{1,2})/(\d{4}))"}; // e.g. 7/21/2022
ifstream ifs{"test_regex_token_iterator.txt"};
ofstream ofs{"test_out_regex_token_iterator.txt"};
regex_token_iterator<string::iterator> rend; // default constructor is used for indicating the end of the sequence
for(string line; getline(ifs, line);){
smatch matches;
string replace_pattern;
int month{0}, day{0}, year{0};
regex_token_iterator<string::iterator> riter(line.begin(), line.end(), pat);
// for each matched substring, replace it individually
while(riter!=rend){
string matched_substring{(*riter).str()};
// *riter returns a reference to the sub_match object riter is pointing to.
// sub_match is not a string. sub_match::str() returns the string of the sub_match.
// put each matched substring into variable "matches"
regex_search(matched_substring, matches, pat);
// get the day, month, and year values in int
day = from_string<int>(matches.str(2));
month = from_string<int>(matches.str(1));
year = from_string<int>(matches.str(3));
// here make replace_pattern yyyy-mm-dd
if(month<10 && day<10)
replace_pattern = to_string(year) "-0" to_string(month) "-0" to_string(day); // both day and month need the fron '0'
else if(month<10)
replace_pattern = to_string(year) "-0" to_string(month) "-" to_string(day);
else if(day<10)
replace_pattern = to_string(year) "-" to_string(month) "-0" to_string(day);
else
replace_pattern = to_string(year) "-" to_string(month) "-" to_string(day);
line = regex_replace(line, regex(matched_substring), replace_pattern); // regex_replace() returns a string
// since I want to replace only 1 matched substring *riter, I use the exact substring
// in the place of regex pattern
riter; // move to the next matched substring
}
ofs << line << endl;
}
return 0;
}
test_regex_token_iterator.txt:
12/01/2022 - 12/31/2022
12/01/2022 - 12/31/2022
12/01/2022 - 12/31/2022
12/01/2022 - 12/31/2022
10/01/2022 - 10/31/2022
10/01/2022 - 10/31/2022
10/01/2022 - 10/31/2022
10/01/2022 - 10/31/2022
10/01/2022 - 10/31/2022
sample test_out_regex_token_iterator.txt (but the result changes in different runs):
2022-12-01 - 12/31/2022
2022-12-01 - 2022-12-31
2022-12-01 - 12/31/2022
2022-12-01 - 12/31/2022
2022-10-01 - 10/31/2022
2022-10-01 - 2022-10-31
2022-10-01 - 10/31/2022
2022-10-01 - 10/31/2022
2022-10-01 - 10/31/2022
I expected all the matched substrings, including the dates in the 2nd column, were replaced, but only part of them were replaced properly. The expected result:
2022-12-01 - 2022-12-31
2022-12-01 - 2022-12-31
2022-12-01 - 2022-12-31
2022-12-01 - 2022-12-31
2022-10-01 - 2022-10-31
2022-10-01 - 2022-10-31
2022-10-01 - 2022-10-31
2022-10-01 - 2022-10-31
2022-10-01 - 2022-10-31
CodePudding user response:
enabling address sanitiser shows that your code is causing undefined behaviour: https://godbolt.org/z/n3rnn9nqY
riter
contains iterators from line
but at the end of your while loop you reassign line
, invalidating line
's iterators and therefore invalidating riter
, when you then try to increment riter
you enter the realms of undefined behaviour.
Adding a separate string for your output fixes the problem: https://godbolt.org/z/Grqe1vv5x
for(string line; getline(ifs, line);){
smatch matches;
string outputLine = line;
string replace_pattern;
int month{0}, day{0}, year{0};
regex_token_iterator<string::iterator> riter(line.begin(), line.end(), pat);
// for each matched substring, replace it individually
while(riter!=rend){
string matched_substring{(*riter).str()};
// *riter returns a reference to the sub_match object riter is pointing to.
// sub_match is not a string. sub_match::str() returns the string of the sub_match.
// put each matched substring into variable "matches"
regex_search(matched_substring, matches, pat);
// get the day, month, and year values in int
day = from_string<int>(matches.str(2));
month = from_string<int>(matches.str(1));
year = from_string<int>(matches.str(3));
// here make replace_pattern yyyy-mm-dd
if(month<10 && day<10)
replace_pattern = to_string(year) "-0" to_string(month) "-0" to_string(day); // both day and month need the fron '0'
else if(month<10)
replace_pattern = to_string(year) "-0" to_string(month) "-" to_string(day);
else if(day<10)
replace_pattern = to_string(year) "-" to_string(month) "-0" to_string(day);
else
replace_pattern = to_string(year) "-" to_string(month) "-" to_string(day);
outputLine = regex_replace(outputLine, regex(matched_substring), replace_pattern); // regex_replace() returns a string
// since I want to replace only 1 matched substring *riter, I use the exact substring
// in the place of regex pattern
riter; // move to the next matched substring
}
ofs << outputLine << endl;
}