I am parsing a simple text file with two columns in C.
The two columns are separated by a tab. While I need the whole line in a later stage I also have to extract the value in the second column.
My implementation of this part is so far (reading a gzipped file):
while (! gzeof(fp)) {
// here I keep the whole line since I need it later (can I do this also faster?)
strcpy(line_save, line);
// get the value in the second column (first removing the newline char.):
line[strcspn(line, "\n")] = 0;
linkage = strtok(line,"\t");
linkage = strtok(NULL,"\t"); // here I have the value in the second col. as the result
// do stuff
gzgets(fp, line, LL);
}
What is a more time-efficient way to do this?
I am reading a gzipped file. gzeof()
checks if EOF is reached and gzgets()
reads one line.
I am not looking for an overly advanced solution here, but I am interested mainly in the "low-hanging fruits". However, if you can present more advances solutions I do not mind.
CodePudding user response:
I'm assuming that gzgets()
behaves in a similar way to fgets()
:
ZEXTERN char * ZEXPORT gzgets OF((gzFile file, char *buf, int len));
Reads bytes from the compressed file until
len-1
characters are read, or a newline character is read and transferred tobuf
, or an end-of-file condition is encountered. If any characters are read or iflen == 1
, the string is terminated with a null character. If no characters are read due to an end-of-file orlen < 1
, then the buffer is left untouched.
gzgets
returnsbuf
which is a null-terminated string, or it returnsNULL
for end-of-file or in case of error. If there was an error, the contents atbuf
are indeterminate.
char line[128]; // Extend as you see fit
while (gzgets(gzfile, line, sizeof(line))) {
line[strcspn(line, "\n")] = '\0';
char col1[64], col2[64];
if (sscanf(line, " cs\tc[^\n]", col1, col2) != 2) {
// Error while parsing the line
puts("Error");
}
// Testing
printf("col1: '%s'\ncol2: '%s'\n", col1, col2);
// And line is untouched.
}
Edit: The below version should run slightly faster than the one above:
- Removed the call for
strcspn()
- The for-loop stops when a
\t
is met, so this avoids scanning the entire string.
char line[128]; // Extend as you see fit
while (gzgets(gzfile, line, sizeof(line))) {
char col1[64], col2[64];
for (char *p = line; *p != '\0' && *p != '\n'; p) {
if (*p == '\t') {
strncpy(col1, line, p - line);
strcpy(col2, p 1);
break;
}
}
// Testing
printf("col1: '%s'\ncol2: '%s'\n", col1, col2);
// And line is untouched.
}
CodePudding user response:
Try the following code. BTW, probably you do not need to create a copy of line
in line_save
as this code does not destruct original line.
If this is the case you can break the inner loop after having set t2
.
while (! gzeof(fp)) {
int i, t1, t2;
t1 = t2 = -1;
for(i=0; line[i]!=0; i ) {
line_save[i] = line[i];
if (line[i] = '\t') {
if (t1 < 0) t1 = i;
else if (t2 < 0) t2 = i;
}
}
line_save[i] = 0;
if (t2 >= 0) {
line[t2] = 0;
linkage = &line[t1 1];
// do what you need with 'linkage'
// reconstruct the original line
line[t2] = '\t';
}
// do other stuf with 'line'
gzgets(fp, line, LL);
}