I have a piece of software that generates a rather large text file full of information about files in a directory. There are often several thousand files. Each one gets a set of informational entries, looking something like:
number
number
IMPORTANT NUMBER
info
info
info
info
info
These repeat. The text file will have the same eight lines for each file in the directory.
I'm supposed to sort this text file by IMPORTANT NUMBERs, the values that appear on the 3rd line, then the 3 8th line, then the 3 8*2 line, etc.
Currently, I'm reading them into a multidimensional character array, looking like:
[number][number][IMPORTANT NUMBER 1][info][info][info][info][info]
[number][number][IMPORTANT NUMBER 2][info][info][info][info][info]
[number][number][IMPORTANT NUMBER 3][info][info][info][info][info]
[number][number][IMPORTANT NUMBER 4][info][info][info][info][info]
etc.
The idea is to sort each set of 8 entries by the important numbers in ascending order. If, for example, my array looked like this:
[number2][number2][2][info2][info2][info2][info2][info2]
[number3][number3][3][info3][info3][info3][info3][info3]
[number1][number1][1][info1][info1][info1][info1][info1]
[number4][number4][4][info4][info4][info4][info4][info4]
After sorting, it would look like:
[number1][number1][1][info1][info1][info1][info1][info1]
[number2][number2][2][info2][info2][info2][info2][info2]
[number3][number3][3][info3][info3][info3][info3][info3]
[number4][number4][4][info4][info4][info4][info4][info4]
...with the value in arr[2]
(1,2,3,4...) being used for the sorting.
The issue is that the information stored in the other columns tends to vary in size. arr[3]
might have a length of 30 characters. arr[4]
might have a length in excess of 5000. Doing this for a lot of data adds up quickly enough that I don't want to just allocate set sizes of whatever that max length might be, especially if I'm just going to use a tiny fraction of it at a time in most of the cases.
I've found a lot of good answers on using qsort
, but very few on sorting large multidimensional string arrays. I'd also LIKE to use qsort
, because I'd prefer not to reinvent the wheel and I doubt anything I'd write would be as efficient.
If anyone could shed some light on how this might be accomplished, I'd appreciate it.
Current code is:
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#define FIELDS 8
int compare(const void *row1, const void *row2);
int main(int argc, char *argv[])
{
// (1) - Open File
const char fname[] = "arrayFile.txt";
FILE *fp = fopen(fname, "r");
printf("Opened file: %s\n", fname);
// (2) - Count Lines
char cr;
size_t lines = 0;
while (cr != EOF)
{
if (cr == '\n')
{
lines ;
}
cr = getc(fp);
}
rewind(fp);
// (3) - Populate Array
char *data[lines / FIELDS][FIELDS];
lines = lines / FIELDS;
size_t n;
for (int i = 0; i < lines; i )
{
for (int j = 0; j < FIELDS; j )
{
data[i][j] = NULL;
size_t n = 0;
getline(&data[i][j], &n, fp);
}
}
// (4) - Print Array Before
for (int i = 0; i < lines; i )
{
for (int j = 0; j < FIELDS; j )
{
printf("%s", data[i][j]);
}
}
printf("\n\nNot sorted\n\n");
// (5) - Sort Array
qsort(data, lines, sizeof(data[0]), compare);
printf("\n\nsorted\n\n");
// (6) - Print Array After
for (int i = 0; i < lines; i )
{
for (int j = 0; j < FIELDS; j )
{
printf("%s", data[i][j]);
free(data[i][j]);
}
}
// Close File
fclose(fp);
printf("\n\nNumber of files: %ld\n", lines);
printf("\n\nNumber of lines: %ld\n", lines * FIELDS);
return 0;
}
int compare(const void *row1, const void *row2)
{
const char *(*a)[8] = row1;
const char *(*b)[8] = row2;
return strcmp((*a)[2], (*b)[2]);
}
This, unfortunately (and predictably), produces a segmentation fault during sort time. I'd estimate it's due to how I'm handling the pointers and indexing, but the EXACT reason why is escaping me.
This seems like a really useful thing to know how to do well for the future, but it's a little more than I've personally tried to do with arrays and pointers in C before.
Thanks in advance.
CodePudding user response:
There are multiple problems in your code:
you should test for
fopen
potential failure to open the file.char cr;
should beint cr;
to handle all 257 possible values returned bygetc()
, assuming 8-bit bytes.cr
is uninitilized during the first iteration ofwhile (cr != EOF)
. You should write this loop as:int cr; while ((cr = getc(fp)) != EOF) { lines = (cr == '\n'); }
as documented by chux, an initial pass to read the whole file is unnecessary, you should instead reallocate the array as you read the file.
char *data[lines / FIELDS][FIELDS];
might define an array that is too large for automatic storage, causing a stack overflowThe correct format specifier for
size_t
is%zu
, not%ld
.size_t
is not along
and might not even have the same size or argument passing convention.the
compare
function uses too many indirections in the type conversion. While your type conversions are probably OK except forconst
correctness, they are difficult to grasp for most programmers. You should use a simpler approach:int compare(const void *row1, const void *row2) { char * const *a = row1; char * const *b = row2; return strcmp(a[2], b[2]); }
note however that the above function will sort the important numbers in lexicographical order, placing
11
between1
and2
. You probably want numeric order instead:int compare(const void *row1, const void *row2) { char * const *a = row1; char * const *b = row2; long na = strtol(a, NULL, 10); long nb = strtol(b, NULL, 10); return (na > nb) - (na < nb); }
CodePudding user response:
Many concerns - I'll mention only one
Counting lines
Do not count lines. (Drop step 2.) Rather than make 2 passes, use 1 pass and instead adjust data
as needed.
Some untested code to give OP an idea:
char *(*data)[FIELDS] = NULL;
size_t records_n = 0; // Allocation total
size_t records_i; // Allocation used
for (records_i = 0; records_i < SIZE_MAX; records_i ) {
if (records_i == records_n) {
size_t records_new_n = records_n * 2 1; // Double the allocation
char *(*newdata)[FIELDS] = realloc(data, sizeof data[0] * records_new_n);
if (newdata == NULL) {
free(data);
fprintf(stderr, "Out of memory.\n");
exit(EXIT_FAILURE);
}
data = newdata;
records_n = records_new_n;
}
int f;
for (f = 0; f < FIELDS; f ) {
data[records_i][f] = NULL;
size_t n = 0;
if (getline(&data[records_i][f], &n, fp) == -1) {
if (f == 0) {
break;
}
fprintf(stderr, "Record ended early.\n");
break; // Or maybe fail?
}
// Lop off potential '\n'
if (n > 0 && data[records_i][f][n - 1] == '\n') {
data[records_i][f][--n] = 0;
}
}
if (f < FIELDS) {
break;
}
}
// Perhaps right-size data to records_i here? Not shown.
// ... Use data
// When all done, free all lines allocated (not shown) and ...
free(data);