Home > Mobile >  C: Time-efficient way to get value of second column in text file
C: Time-efficient way to get value of second column in text file

Time:07-19

I am parsing a simple text file with two columns in C.

The two columns are separated by a tab. While I need the whole line in a later stage I also have to extract the value in the second column.

My implementation of this part is so far (reading a gzipped file):

while (! gzeof(fp)) {

   // here I keep the whole line since I need it later (can I do this also faster?)
   strcpy(line_save, line);

   // get the value in the second column (first removing the newline char.):
   line[strcspn(line, "\n")] = 0;
   linkage = strtok(line,"\t");
   linkage = strtok(NULL,"\t"); // here I have the value in the second col. as the result

   // do stuff

   gzgets(fp, line, LL);
}

What is a more time-efficient way to do this?

I am reading a gzipped file. gzeof() checks if EOF is reached and gzgets() reads one line.

I am not looking for an overly advanced solution here, but I am interested mainly in the "low-hanging fruits". However, if you can present more advances solutions I do not mind.

CodePudding user response:

I'm assuming that gzgets() behaves in a similar way to fgets():

ZEXTERN char * ZEXPORT gzgets OF((gzFile file, char *buf, int len));

Reads bytes from the compressed file until len-1 characters are read, or a newline character is read and transferred to buf, or an end-of-file condition is encountered. If any characters are read or if len == 1, the string is terminated with a null character. If no characters are read due to an end-of-file or len < 1, then the buffer is left untouched.

gzgets returns buf which is a null-terminated string, or it returns NULL for end-of-file or in case of error. If there was an error, the contents at buf are indeterminate.

char line[128]; // Extend as you see fit
while (gzgets(gzfile, line, sizeof(line))) {
    line[strcspn(line, "\n")] = '\0';
    
    char col1[64], col2[64];
    if (sscanf(line, " cs\tc[^\n]", col1, col2) != 2) {
        // Error while parsing the line
        puts("Error");
    }
    
    // Testing
    printf("col1: '%s'\ncol2: '%s'\n", col1, col2);
    
    // And line is untouched.
}

Edit: The below version should run slightly faster than the one above:

  • Removed the call for strcspn()
  • The for-loop stops when a \t is met, so this avoids scanning the entire string.
char line[128]; // Extend as you see fit
while (gzgets(gzfile, line, sizeof(line))) {
    char col1[64], col2[64];
    for (char *p = line; *p != '\0' && *p != '\n';   p) {
        if (*p == '\t') {
            strncpy(col1, line, p - line);
            strcpy(col2, p 1);
            break;
        }
    }
    
    // Testing
    printf("col1: '%s'\ncol2: '%s'\n", col1, col2);
    
    // And line is untouched.
}

CodePudding user response:

Try the following code. BTW, probably you do not need to create a copy of line in line_save as this code does not destruct original line. If this is the case you can break the inner loop after having set t2.

while (! gzeof(fp)) {
    int i, t1, t2;
    
    t1 = t2 = -1;
    for(i=0; line[i]!=0; i  ) {
        line_save[i] = line[i];
        if (line[i] = '\t') {
            if (t1 < 0) t1 = i;
            else if (t2 < 0) t2 = i;
        }
    }
    line_save[i] = 0;

    if (t2 >= 0) {
        line[t2] = 0;
        linkage = &line[t1 1];
        // do what you need with 'linkage'

        // reconstruct the original line
        line[t2] = '\t';
    }

    // do other stuf with 'line'

    gzgets(fp, line, LL);
}
  • Related