Home > OS >  Regex expression doesn't recognize dot at end of word - Regex (C )
Regex expression doesn't recognize dot at end of word - Regex (C )

Time:01-17

I'm trying to read a line out of a file using the following regex expression:

^([A-z.] ?\\s?[A-z] )\\s([A-z] )\\s(\\d{7})\\s(\\d?\\d.\\d)$

on the line:

W.W. Sneijder 0000574 10.0

(To be clear: the intent is to make any word with chars [a-z], [A-Z], or dots, match with the [A-z.] part.)

However, the regular expression doesn't recognize the second dot in W.W., which seems strange to me. Don't the square brackets combined with the mean that any character from inside them is accepted, until (here) whitespace is encountered? I found a regex that does work but isn't that elegant:

^([A-z.] [.\\s?[A-z] )\\s([A-z] )\\s(\\d{7})\\s(\\d?\\d.\\d)$

I'm hoping the find an elegant solution. It'd be great to hear your input.

Links such as RegEx - Not parsing dot(.) at the end of a sentence didn't seem to answer my question unfortunately.

CodePudding user response:

Space separated data is just a different variant of the common CSV (Comma Separated Values) format. There are many ways to separate a string on arbitrary separators, but in C using space is actually very easy:

std::vector<std::string> separate_on_space(std::string const& input)
{
    std::vector<std::string> output;
    std::istringstream iss(input);

    // Copy all space-separated "words" from the input to the vector
    std::copy(std::istream_iterator<std::string>(iss), // Begin iterator
              std::istream_iterator<std::string>(),    // End iterator
              std::back_inserter(output));             // Destination iterator

    return output;
}

[See example here]

Once you have separated the values into a vector of strings, you can then convert the numeric values to their actual type (for example using std::stod) and store into suitable objects.


Of course this doesn't handle names with spaces in them in a graceful way, but that can be handled at a higher level (by checking the size of the resulting vector, and by knowing the last two elements should always the special numbers, and the rest are the names).

On the other hand the regular expression in the question doesn't handle it at all. :)

CodePudding user response:

In your regex, the entire W.W. Sneijder is captured in the first group. Looking at your regex, I doubt you intended it that way.

I think the regex you wanted is ^([A-z.] ?\s?[A-z] )\s(\d{7})\s(\d?\d.\d)$.
Or if you wanted Sneijder to be in the second capture: ^([A-z.] ?)\s([A-z] )\s(\d{7})\s(\d?\d.\d)$.

... or maybe you wanted ^([A-z.] ?\s?[A-z]*)\s([A-z] )\s(\d{7})\s(\d?\d.\d)$ (* instead of in the first capture group).
or ^([A-z.] ?(?:\s[A-z] )?)\s([A-z] )\s(\d{7})\s(\d?\d.\d)$ (optional space text, again in the first capture groups).

All 4 expressions should match your test string, but behave differently on other test strings.


There certainly are improvements to the regex, such as ensuring the string does not start with a ..

As long as you touch the inside of each capture group but not the logic across capture groups, you can let the regex manage any level of control you desire and this will have no impact on the code that follows the text parsing.
It will always be 4 capture groups, with, except the first regex I posted above that has only 3 capture groups, with some guarantees on the text if you need to convert it to another type.

  • Related