Home > OS >  Why does std::views::split() compile but not split with an unnamed string literal as a pattern?
Why does std::views::split() compile but not split with an unnamed string literal as a pattern?

Time:11-01

When std::views::split() gets an unnamed string literal as a pattern, it will not split the string but works just fine with an unnamed character literal.

#include <iomanip>
#include <iostream>
#include <ranges>
#include <string>
#include <string_view>

int main(void)
{
    using namespace std::literals;

    // returns the original string (not splitted)
    auto splittedWords1 = std::views::split("one:.:two:.:three", ":.:");
    for (const auto word : splittedWords1)
        std::cout << std::quoted(std::string_view(word));
    
    std::cout << std::endl;

    // returns the splitted string
    auto splittedWords2 = std::views::split("one:.:two:.:three", ":.:"sv);
    for (const auto word : splittedWords2)
        std::cout << std::quoted(std::string_view(word));
    
    std::cout << std::endl;

    // returns the splitted string
    auto splittedWords3 = std::views::split("one:two:three", ':');
    for (const auto word : splittedWords3)
        std::cout << std::quoted(std::string_view(word));
    
    std::cout << std::endl;

    // returns the original string (not splitted)
    auto splittedWords4 = std::views::split("one:two:three", ":");
    for (const auto word : splittedWords4)
        std::cout << std::quoted(std::string_view(word));
    
    std::cout << std::endl;

    return 0;
}

See live @ godbolt.org.

I understand that string literals are always lvalues. But even though, I am missing some important piece of information that connects everything together. Why can I pass the string that I want splitted as an unnamed string literal whereas it fails (as-in: returns a range of ranges with the original string) when I do the same with the pattern?

CodePudding user response:

String literals always end with a null-terminator, so ":.:" is actually a range with the last element of \0 and a size of 4.

Since the original string does not contain such a pattern, it is not split.

When dealing with C 20 ranges, I strongly recommend using string_view instead of raw string literals, which works well with <ranges> and can avoid the error-prone null-terminator issue.

CodePudding user response:

This answer is completely correct, I'd just like to add a couple additional notes that might be interesting.


First, if you use {fmt} for printing, it's a lot easier to see what's going on, since you also don't have to write your own loop. You can just write this:

fmt::print("{}\n", rv::split("one:.:two:.:three", ":.:"));

Which will output (this is the default output for a range of range of char):

[[o, n, e, :, ., :, t, w, o, :, ., :, t, h, r, e, e, ]]

In C 23, there will be a way to directly specify that this print as a range of strings, but that hasn't been added to {fmt} yet. In the meantime, because split preserves the initial range category, you can add:

auto to_string_views = std::views::transform([](auto sr){
    return std::string_view(sr.data(), sr.size());
});

And then:

fmt::print("{}\n", std::views::split("one:.:two:.:three", ":.:") | to_string_views);

prints:

["one:.:two:.:three\x00"]

Note the visibly trailing zero. Likewise, the next three attempts format as:

["one", "two", "three\x00"]
["one", "two", "three\x00"]
["one:two:three\x00"]

The fact that we can clearly see the \x00 helps track down the issue.


Next, consider the difference between:

std::views::split("one:.:two:.:three", ":.:")

and

"one:.:two:.:three" | std::views::split(":.:")

We typically consider these to be equivalent, but they're... not entirely. In the latter case, the library has to capture and stash these values - which involves decaying them. In this case, because ":.:" decays into char const*, that's no longer a valid pattern for the incoming string literal. So the above doesn't actually compile.

Now, it'd be great if it both compiled and also worked correctly. Unfortunately, it's impossible to tell in the language between a string literal (where you don't want to include the null terminator) and an array of char (where you want to include the whole array). So at least, with this latter formulation, you can get the wrong thing to not compile. And at least - "doesn't compile" is better than "compiles and does something wildly different from what I expected"?


Demo.

  • Related