Home > Software engineering >  C Regex Not returning intended pattern
C Regex Not returning intended pattern

Time:11-13

I have a string that is received over Ethernet and I want to convert the string to std::float_t data.
I'm using std::regex_replace to strip out all non-digit, non-decimal, non-sign (plus, minus) and replace with empty string.
When I execute the std::regex_replace, it only returns digits.
Code

std::string::size_type  sz;
const std::regex        e("[(\D)|(^ -.)]");   // matches not digit, decimal point (.), plus sign, minus sign
std::string             numeric_string = std::regex_replace(temporary_recieve_data, e, "");
std::float_t            value = std::stof(numeric_string, &sz);

CodePudding user response:

As in many languages, backslashes in C string literals are interpreted by the C compiler, so by the time the program is actually run, they have been replaced by whatever replacement is defined by the C language, or whatever other escape sequences the compiler might accept.

The C language only defines a handful of escape sequences, and \D is not one of them. But that shouldn't be a surprise, because \D is only meaningful as part of a pattern given to the std::regex constructor, which is a function executed at runtime. So for that constructor to see a regex special \-sequence, the backslash itself has to be present in the string passed to the constructor.

Perhaps this will make that a bit clearer:

#include <iostream>
#include <string>
int main() {
  // Here is a string with a \D escape:
  std::string escaped_D("\D");
  std::cout << "escaped_D produces the string '"
            << escaped_D
            << "'\n";

  // Now, we'll do that again the right way
  std::string escaped_escape_and_D("\\D");
  std::cout << "escaped_escape_and_D produces the string '"
            << escaped_escape_and_D
            << "'\n";
}

If we compile that and run it, we'll see:

escaped_D produces the string 'D'
escaped_escape_and_D produces the string '\D'

To recap, the string literal "\D" produces a string containing exactly one character, an upper-case D. The string literal "\\D" produces a string containing two characters, first a backslash and then the uppercase D. For the regex library to see \D, you'll need the second version.

(Gcc is also kind enough to warn you that \D is not a valid escape sequence in a C string literal. When I actually compiled that program, the compiler reported:

<stdin>: In function ‘int main()’:
<stdin>:5:25: warning: unknown escape sequence: '\D'

Which is also useful to know. GCC replaces any unknown escape sequence with the character following the backslash, so that \D becomes D, as we just saw. But that is specific to GCC. Another compiler might do something else, even refusing to compile the program. \\ is a valid escape sequence, for which the C standard requires the compiler to substitute a single backslash. So the second string in the above program is fine, and does not produce a warning.

OK, let's move on to the regular expression. Your regular expression is a "bracket expression", usually called a character class, which is a set of characters. The character class will match exactly one character if it is in the set. The precise semantics of regular expressions vary, but the particular flavour you are using (the C default regex syntax, with GCC extensions) does allow the use of backslash-escape sequences which represent sets of characters to be included in a character class. But inside a character class, most regex operators are not considered special characters. Parentheses and vertical bars just represent themselves. The only character which is special, other than a backslash, is a dash (-) which is used to represent a range of characters. (Only inside a character class, though. Outside of the character class, dash just represents itself.)

So, here's your regular expression as a C literal string:

"[(\D)|(^ -.)]"

As we've seen the intention was actually:

"[(\\D)|(^ -.)]"

So, what does that mean to the regex library? Simple: it means a character class, whose members are:

  • the character (
  • the set of characters consisting of everything other than digits (\D)
  • the character )
  • the character |
  • the character ( (again, but that's OK. It's a set of characters and you're allowed to write them twice.)
  • the character ^ (this would be special if it were at the beginning of the character class. But it's not.)
  • the range of characters -. consisting of all characters whose Ascii codes are between and ., inclusive. is 0x2B and . is 0x2E; the range also includes the codes in between, which are , (0x2C) and - (0x2D). So, by coincidence, it almost means what you thought you wrote, except that it also includes a comma.
  • the character ), another repeat.

Of course, since the character set include all non-digits -- \D -- those other characters are all redundant. They are all non-digits, so you could have replaced the entire regular expression with just \D. But that's not what you wanted; you wanted anything other than a digit or , - or .. So to get that, you need something like this regular expression:

"[^\\d. -]"  // This is really the regex [^\d. -]

That one uses ^ as the first symbol in the character class, which means that the entire character class is inverted. So the list is the set of characters you want to the pattern to not match, which is why I used \d (the set of digits). I also took care to put the minus sign at the end of the character class, so that it doesn't look like a range of characters. I could also have put it at the beginning, right after the ^ or I could have escaped it with a \\, but it's usually easiest to put it at the end. I could also have written the digits out as a range, which might be easier to follow:

"[^0-9. -]"   // This means exactly the same as the previous one.

Now, we can try using that:

#include <iostream>
#include <regex>
#include <string>
int main(int argc, char* argv[]) {
  const std::regex nondigits("[^\\d. -]");
  std::string text("PI, to five decimals, is 3.14159");
  std::string number = std::regex_replace(text, nondigits, "");
  std::cout << text << " -> " << number << '\n';
  return 0;
}

That compiles without any warnings, and produces the expected output:

PI, to five decimals, is 3.14159 -> 3.14159

CodePudding user response:

  1. Inside the square brackets[], you can't use "(" ")" as match group and can't use "|" as OR operation. If you use, it is interpreted as a mere character.
  2. Inside the square brackets[], "-" is interpreted as a range operator (a-z). If you want to use it as a mere character, you should put "-" at first in the brackets.
  3. You can use Raw string R"( ... )" to easily handle escape characters.

As a result, you can write it like this and it works well.

const std::regex        e(R"([^- .\d])");
  •  Tags:  
  • c
  • Related