Is this fscanf behavior inconsistent?-CodePudding

Typically fscanf, when scanning a non-integer using %d, will fail until the non-integer characters are explicitly removed from the input stream. Trying to scan a123 fails, until the a is removed from the input stream.

Trying to scan ------123 fails (fscanf returns 0) but the - is removed from the input stream.

Is this correct behavior for fscanf?

The file contains ----------123 and the result of this code:

#include <stdio.h>

int main(void) {
    int number = 0;
    int result = 0;
    FILE *pf = NULL;

    if (NULL != (pf = fopen("integer.txt", "r"))) {
        while (1) {
            if (1 == (result = fscanf(pf, "%d", &number))) {
                printf("%d\n", number);
            } else {
                if (EOF == result) {
                    break;
                }
                printf("result is %d\n", result);
            }
        }
        fclose(pf);
    }
    return 0;
}

is:

result is 0
result is 0
result is 0
result is 0
result is 0
result is 0
result is 0
result is 0
result is 0
-123

If the file contains a123 the result is an infinite loop.

That seems to me to be inconsistent behavior. No?

CodePudding user response：

The point here is not one of inconsistency, but one of the many limitations of the *scanf() family.

The standard is very specific on how *scanf() parses input. Characters are taken from input one by one, and checked against the format string. If they match, the next character is taken from input. If they don't match, the character is "put back", and the conversion fails.

But only that last character read is ever put back.

(This is made explicit in C11 footnote 285. And it actually has nothing to do with the one byte of push-back that ungetc() guarantees -- because a library function may not call ungetc() -- that byte of push-back is reserved for the user.)

This allows libraries to cram that one byte of "put back" input somewhere, instead of having to have some kind of buffer that would be large enough for all kind of eventualities.

It also makes *scanf() fail in the middle of certain character sequences, without actually retracing to where it began its conversion attempt.

In your case, "--123" read as "%d":

taking the first '-'. Sign. All is well, continue.
taking the second '-'. Matching error.
Put back the last '-'. Cannot put back the second '-' as per above.
Return 0 (conversion failed).

This is (one of) the reason(s) why you should not ever use *scanf() on potentially malformed input: The scan can fail without you knowing where exactly it failed.

It's also a murky corner of the standard that was not actually implemented correctly in a number of mainstream library implementations last time I checked. ;-)

Other reasons include, but are not limited to, numerical overflows handled not at all gracefully. Hence the usual recommendation is to read full lines of input with fgets(), then parse the line in-memory using strtol(), strtod() etc., which can and will handle things like the above in a well-defined way.

CodePudding user response：

Is this correct behavior for fscanf?

Yes, it is, as pointed out by @stark in comments, - is part of the result when you use %d as format specifier.

If you want to scan a positive integer (only digits) you can use a pattern in fscanf to discard all non digits.

fscanf(pf, "%*[^0-9]%d", &number)

CodePudding user response：

This behavior is specified:

Here are the relevant paragraphs from the C2x Standard:

7.21.6.2 The fscanf function

[...]

_⁷   A directive that is a conversion specification defines a set of matching input sequences, as described below for each specifier. A conversion specification is executed in the following steps:
_⁸   Input white-space characters are skipped, unless the specification includes a [, c, or n specifier.
_⁹   An input item is read from the stream, unless the specification includes an n specifier. An input item is defined as the longest sequence of input characters which does not exceed any specified field width and which is, or is a prefix of, a matching input sequence.³¹⁰⁾ The first character, if any, after the input item remains unread. If the length of the input item is zero, the execution of the directive fails; this condition is a matching failure unless end-of-file, an encoding error, or a read error prevented input from the stream, in which case it is an input failure.
_¹⁰   Except in the case of a % specifier, the input item (or, in the case of a %n directive, the count of input characters) is converted to a type appropriate to the conversion specifier. If the input item is not a matching sequence, the execution of the directive fails: this condition is a matching failure. Unless assignment suppression was indicated by a *, the result of the conversion is placed in the object pointed to by the first argument following the format argument that has not already received a conversion result. If this object does not have an appropriate type, or if the result of the conversion cannot be represented in the object, the behavior is undefined.

^{310) fscanf pushes back at most one input character onto the input stream. Therefore, some sequences that are acceptable to strtod, strtol, etc., are unacceptable to fscanf.}

In your example, the initial - is a prefix of a matching input sequence, and the next character, another -, does not match so it remains in the input stream. The input item, -, is not a matching sequence so you get a conversion failure and 0 is returned but the first - was consumed.

This behavior is observed on linux with the GNUlibc, but not on macOS with Apple Libc, where the initial dash is not consumed.