Home > OS >  Combining regex and ranges causes memory issues
Combining regex and ranges causes memory issues

Time:05-12

I wanted to construct a view over all the sub-matches of regex in text. Here are two ways to define such a view:

    char const text[] = "The IP addresses are: 192.168.0.25 and 127.0.0.1";
    std::regex regex{R"((\d{1,3})\.(\d{1,3})\.(\d{1,3})\.(\d{1,3}))"};

    auto sub_matches_view = 
        std::ranges::subrange(
            std::cregex_iterator{std::ranges::begin(text), std::ranges::end(text), regex},
            std::cregex_iterator{}
        ) |
        std::views::join;

    auto sub_matches_sv_view = 
        std::ranges::subrange(
            std::cregex_iterator{std::ranges::begin(text), std::ranges::end(text), regex},
            std::cregex_iterator{}
        ) |
        std::views::join |
        std::views::transform([](std::csub_match const& sub_match) -> std::string_view { return {sub_match.first, sub_match.second}; });
  • sub_matches_view's value type is std::csub_match. It is created by first constructing a view of std::cmatch objects (via the regex iterator), and since each std::cmatch is a range of std::csub_match objects, it is flattened with std::views::join.
  • sub_matches_sv_view's value type is std::string_view. It is identical to sub_matches_view, except it also wraps each element of sub_matches_view in a std::string_view.

Here's an usage example of the above ranges:

for(auto const& sub_match : sub_matches_view) {
    std::cout << std::string_view{sub_match.first, sub_match.second} << std::endl; // #1
}

for(auto const& sv : sub_matches_sv_view) {
    std::cout << sv << std::endl; // #2
}

Loop #1 works without problems - the printed results are correct. However, loop #2 causes heap-use-after-free issues according to the Address Sanitizer. In fact, just looping over sub_matches_sv_view without accessing the elements at all causes this problem too. Here is the code on Compiler Explorer as well as the output of the Address Sanitizer.

I am out of ideas as to where my mistake is. text and regex never go out of scope, I don't see any iterators that might be accessed outside of their lifetimes. The std::csub_match object holds iterators (.first, .second) into text, so I don't think it needs to remain alive itself after constructing the std::string_view in std::views::transform.

I know there are many other ways to iterate over regex matches, but I am specifically interested in what's causing the memory bugs in my program, I don't need work-arounds for this issue.

CodePudding user response:

The problem is std::regex_iterator and the fact that it stashes.


That type basically looks like this:

class regex_iterator {
    vector<match> matches;

public:
    auto operator*() const -> vector<match> const& { return matches; }
};

What this means, for instance, is that even though this iterator's reference type is T const&, if you have two copies of the same iterator, they'll actually give you references into different objects.

Now, join_view<R>::iterator basically looks like this:

class iterator {
    // the iterator into the range we're joining
    iterator_t<R> outer;

    // an iterator into *outer that we're iterating over
    iterator_t<range_reference_t<R>> inner;
};

Which, for regex_iterator, roughly looks like this:

class iterator {
    // the regex matches
    vector<match> outer;

    // the current match
    match* inner;
};

Now, what happens when you copy this iterator? The copy's inner still refers to the original's outer! This aren't actually independent in the way that you'd expect. Which means that if the original goes out of scope, we have a dangling iterator!

This is what you're seeing here: transform_view ends up copying the iterator (as it is certainly allowed to do), and now you have a dangling iterator (libc 's implementation moves instead, which is why it happens to work in this case as 康桓瑋 pointed out). But we can reproduce the same issue without transform as long as we copy the iterator and destroy the original. For instance:

#include <ranges>
#include <regex>
#include <iostream>
#include <optional>

int main() {
    std::string_view text = "The IP addresses are: 192.168.0.25 and 127.0.0.1";
    std::regex regex{R"((\d{1,3})\.(\d{1,3})\.(\d{1,3})\.(\d{1,3}))"};

    auto a =  std::ranges::subrange(
            std::cregex_iterator(std::ranges::begin(text), std::ranges::end(text), regex),
            std::cregex_iterator{}
        );

    auto b = a | std::views::join;

    std::optional i = b.begin();
    std::cout << std::string_view((*i)->first, (*i)->second) << '\n'; // fine

    auto j = *i;
    i.reset();
    std::cout << std::string_view(j->first, j->second) << '\n'; // boom
}

I'm not sure what a solution to this problem would look like, but the cause is the std::regex_iterator and not the views::join or the views::transform.

  • Related