I wanted to construct a view over all the sub-matches of regex
in text
. Here are two ways to define such a view:
char const text[] = "The IP addresses are: 192.168.0.25 and 127.0.0.1";
std::regex regex{R"((\d{1,3})\.(\d{1,3})\.(\d{1,3})\.(\d{1,3}))"};
auto sub_matches_view =
std::ranges::subrange(
std::cregex_iterator{std::ranges::begin(text), std::ranges::end(text), regex},
std::cregex_iterator{}
) |
std::views::join;
auto sub_matches_sv_view =
std::ranges::subrange(
std::cregex_iterator{std::ranges::begin(text), std::ranges::end(text), regex},
std::cregex_iterator{}
) |
std::views::join |
std::views::transform([](std::csub_match const& sub_match) -> std::string_view { return {sub_match.first, sub_match.second}; });
sub_matches_view
's value type isstd::csub_match
. It is created by first constructing a view ofstd::cmatch
objects (via the regex iterator), and since eachstd::cmatch
is a range ofstd::csub_match
objects, it is flattened withstd::views::join
.sub_matches_sv_view
's value type isstd::string_view
. It is identical tosub_matches_view
, except it also wraps each element ofsub_matches_view
in astd::string_view
.
Here's an usage example of the above ranges:
for(auto const& sub_match : sub_matches_view) {
std::cout << std::string_view{sub_match.first, sub_match.second} << std::endl; // #1
}
for(auto const& sv : sub_matches_sv_view) {
std::cout << sv << std::endl; // #2
}
Loop #1
works without problems - the printed results are correct. However, loop #2
causes heap-use-after-free issues according to the Address Sanitizer. In fact, just looping over sub_matches_sv_view
without accessing the elements at all causes this problem too. Here is the code on Compiler Explorer as well as the output of the Address Sanitizer.
I am out of ideas as to where my mistake is. text
and regex
never go out of scope, I don't see any iterators that might be accessed outside of their lifetimes. The std::csub_match
object holds iterators (.first
, .second
) into text
, so I don't think it needs to remain alive itself after constructing the std::string_view
in std::views::transform
.
I know there are many other ways to iterate over regex matches, but I am specifically interested in what's causing the memory bugs in my program, I don't need work-arounds for this issue.
CodePudding user response:
The problem is std::regex_iterator
and the fact that it stashes.
That type basically looks like this:
class regex_iterator {
vector<match> matches;
public:
auto operator*() const -> vector<match> const& { return matches; }
};
What this means, for instance, is that even though this iterator's reference type is T const&
, if you have two copies of the same iterator, they'll actually give you references into different objects.
Now, join_view<R>::iterator
basically looks like this:
class iterator {
// the iterator into the range we're joining
iterator_t<R> outer;
// an iterator into *outer that we're iterating over
iterator_t<range_reference_t<R>> inner;
};
Which, for regex_iterator
, roughly looks like this:
class iterator {
// the regex matches
vector<match> outer;
// the current match
match* inner;
};
Now, what happens when you copy this iterator? The copy's inner
still refers to the original's outer
! This aren't actually independent in the way that you'd expect. Which means that if the original goes out of scope, we have a dangling iterator!
This is what you're seeing here: transform_view
ends up copying the iterator (as it is certainly allowed to do), and now you have a dangling iterator (libc 's implementation moves instead, which is why it happens to work in this case as 康桓瑋 pointed out). But we can reproduce the same issue without transform
as long as we copy the iterator and destroy the original. For instance:
#include <ranges>
#include <regex>
#include <iostream>
#include <optional>
int main() {
std::string_view text = "The IP addresses are: 192.168.0.25 and 127.0.0.1";
std::regex regex{R"((\d{1,3})\.(\d{1,3})\.(\d{1,3})\.(\d{1,3}))"};
auto a = std::ranges::subrange(
std::cregex_iterator(std::ranges::begin(text), std::ranges::end(text), regex),
std::cregex_iterator{}
);
auto b = a | std::views::join;
std::optional i = b.begin();
std::cout << std::string_view((*i)->first, (*i)->second) << '\n'; // fine
auto j = *i;
i.reset();
std::cout << std::string_view(j->first, j->second) << '\n'; // boom
}
I'm not sure what a solution to this problem would look like, but the cause is the std::regex_iterator
and not the views::join
or the views::transform
.