Home > Enterprise >  Gnu regex fails if `size` parameter of getline is not reset to 0
Gnu regex fails if `size` parameter of getline is not reset to 0

Time:07-20

The code below consist of read_files() that reads a bunch of text files and match() function that does string matching against a pattern using the gnu regex library.

inside read_files() i use getline() with size argument set to 0 so that getline() will start with the default 120 size and then increased as needed

#include <limits.h> // for PATH_MAX
#include <regex.h>  // for regcomp, regerror, regexec, regfree, size_t, REG...
#include <stdio.h>  // for printf, fprintf, NULL, fclose, fopen, getline
#include <stdlib.h> // for exit, free, EXIT_FAILURE

int match(const char *regex_str, const char *str) {

    regex_t regex;
    int reti;
    char msgbuf[100];

    /* Compile regular expression */
    reti = regcomp(&regex, regex_str, REG_EXTENDED);
    if (reti) {
        fprintf(stderr, "Could not compile regex\n");
        exit(1);
    }

    /* Execute regular expression */
    reti = regexec(&regex, str, 0, NULL, 0);
    if (!reti) {
        return 1;
    } else if (reti == REG_NOMATCH) {
        return 0;
    } else {
        regerror(reti, &regex, msgbuf, sizeof(msgbuf));
        fprintf(stderr, "Regex match failed: %s\n", msgbuf);
        exit(1);
    }

    /* Free memory allocated to the pattern buffer by regcomp() */
    regfree(&regex);
}

void read_files() {

    size_t path_count = 2;
    char pathnames[2][PATH_MAX] = {"./tmp/test0.conf", "./tmp/test1.conf"};

    FILE *fp;
    char *line = NULL;
    size_t len = 0;
    ssize_t read_count;

    for (int i = 0; i < path_count; i  ) {
        printf("opening file %s\n", pathnames[i]);

        fp = fopen(pathnames[i], "r");
        if (fp == NULL) {
            printf("internal error,couldn't open file %s\"}", pathnames[i]);
            exit(EXIT_FAILURE);
        }
        int linenum=1;
        while ((read_count = getline(&line, &len, fp)) != -1) {
            printf("%d: %s",linenum,line);
            linenum  ;
        }
        printf("len: %zu\n", len);

        fclose(fp);
        // len=0; // this is the line that fixes the bug, if i reset len to 0 after reading the first file then everything works as expected, if i don't reset it then regex matching fails
        if (line)
            free(line);
    }
}

int main(int argc, char *argv[]) {
    read_files();

    if (!match("^[a-zA-Z0-9] $", "jack")) {
        printf("input don't match\n");
    }
}

the content of test0.conf

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

the content of test1.conf

testing123

when running the above code i get this output:

opening file ./tmp/test0.conf
1: AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
len: 240
opening file ./tmp/test1.conf
1: testing123
len: 240
input don't match

so the pattern matching is failing with the string "jack" which in reality matches.

You can see that after finishing reading the first file that len is set to 240 so when getline gets executed again for the second file it will read the file with 240 buffer size, but this for some reason causes the regex matching to fail.

If i reset the len to 0 argument after reading the first file then the code works as expected(the regex matching works fine).

So why does the getline() len parameter affect the behavior of the gnu regex?

CodePudding user response:

So why does the getline() len parameter affect the behavior of the gnu regex?

As Marian commented, you are using getline incorrectly, causing it to corrupt heap. You can observe this by compiling the program with -fsanitize=address flag and running it. See the Address Sanitizer manual to understand the error.

This is undefined behavior, and your program can do anything. Here it just happens to cause the GNU regex library to stop working correctly. A SIGSEGV is another likely outcome.

To fix the problem, you should move the free call out of the loop and only free the memory after you are done reading the lines.

Setting line = NULL in the loop after you free it is another possible (but less efficient) fix.

  • Related