Home > Blockchain >  Using qsort to sort a multidimensional array of variable-length strings in C
Using qsort to sort a multidimensional array of variable-length strings in C

Time:04-15

I have a piece of software that generates a rather large text file full of information about files in a directory. There are often several thousand files. Each one gets a set of informational entries, looking something like:

number
number
IMPORTANT NUMBER
info
info
info
info
info

These repeat. The text file will have the same eight lines for each file in the directory.

I'm supposed to sort this text file by IMPORTANT NUMBERs, the values that appear on the 3rd line, then the 3 8th line, then the 3 8*2 line, etc.

Currently, I'm reading them into a multidimensional character array, looking like:

[number][number][IMPORTANT NUMBER 1][info][info][info][info][info]
[number][number][IMPORTANT NUMBER 2][info][info][info][info][info]
[number][number][IMPORTANT NUMBER 3][info][info][info][info][info]
[number][number][IMPORTANT NUMBER 4][info][info][info][info][info]

etc.

The idea is to sort each set of 8 entries by the important numbers in ascending order. If, for example, my array looked like this:

[number2][number2][2][info2][info2][info2][info2][info2]
[number3][number3][3][info3][info3][info3][info3][info3]
[number1][number1][1][info1][info1][info1][info1][info1]
[number4][number4][4][info4][info4][info4][info4][info4]

After sorting, it would look like:

[number1][number1][1][info1][info1][info1][info1][info1]
[number2][number2][2][info2][info2][info2][info2][info2]
[number3][number3][3][info3][info3][info3][info3][info3]
[number4][number4][4][info4][info4][info4][info4][info4]

...with the value in arr[2] (1,2,3,4...) being used for the sorting.

The issue is that the information stored in the other columns tends to vary in size. arr[3] might have a length of 30 characters. arr[4] might have a length in excess of 5000. Doing this for a lot of data adds up quickly enough that I don't want to just allocate set sizes of whatever that max length might be, especially if I'm just going to use a tiny fraction of it at a time in most of the cases.

I've found a lot of good answers on using qsort, but very few on sorting large multidimensional string arrays. I'd also LIKE to use qsort, because I'd prefer not to reinvent the wheel and I doubt anything I'd write would be as efficient.

If anyone could shed some light on how this might be accomplished, I'd appreciate it.

Current code is:

#include <stdlib.h>
#include <stdio.h>
#include <string.h>

#define FIELDS 8

int compare(const void *row1, const void *row2);

int main(int argc, char *argv[])
{
    // (1) - Open File
    const char fname[] = "arrayFile.txt";

    FILE *fp = fopen(fname, "r");

    printf("Opened file: %s\n", fname); 

    // (2) - Count Lines
    char cr;
    size_t lines = 0;
    while (cr != EOF)
    {
        if (cr == '\n') 
        {
            lines  ;
        }
        cr = getc(fp);
    } 
    rewind(fp);

    // (3) - Populate Array
    char *data[lines / FIELDS][FIELDS];
    lines = lines / FIELDS;
    size_t n;

    for (int i = 0; i < lines; i  ) 
    {
        for (int j = 0; j < FIELDS; j  )
        {
            data[i][j] = NULL;
            size_t n = 0;
            getline(&data[i][j], &n, fp);
        }    
    }

    // (4) - Print Array Before
    for (int i = 0; i < lines; i  ) 
    {
        for (int j = 0; j < FIELDS; j  )
        {
            printf("%s", data[i][j]);
        }
    }

    printf("\n\nNot sorted\n\n");

    // (5) - Sort Array
    qsort(data, lines, sizeof(data[0]), compare);

    printf("\n\nsorted\n\n");

    // (6) - Print Array After
    for (int i = 0; i < lines; i  ) 
    {
        for (int j = 0; j < FIELDS; j  )
        {
            printf("%s", data[i][j]);
            free(data[i][j]);
        }
    }

    // Close File
    fclose(fp);

    printf("\n\nNumber of files: %ld\n", lines);
    printf("\n\nNumber of lines: %ld\n", lines * FIELDS);

    return 0;
}

int compare(const void *row1, const void *row2)
{
    const char *(*a)[8] = row1;
    const char *(*b)[8] = row2;

    return strcmp((*a)[2], (*b)[2]);
}

This, unfortunately (and predictably), produces a segmentation fault during sort time. I'd estimate it's due to how I'm handling the pointers and indexing, but the EXACT reason why is escaping me.

This seems like a really useful thing to know how to do well for the future, but it's a little more than I've personally tried to do with arrays and pointers in C before.

Thanks in advance.

CodePudding user response:

There are multiple problems in your code:

  • you should test for fopen potential failure to open the file.

  • char cr; should be int cr; to handle all 257 possible values returned by getc(), assuming 8-bit bytes.

  • cr is uninitilized during the first iteration of while (cr != EOF). You should write this loop as:

      int cr;
      while ((cr = getc(fp)) != EOF) {
          lines  = (cr == '\n');
      }
    
  • as documented by chux, an initial pass to read the whole file is unnecessary, you should instead reallocate the array as you read the file.

  • char *data[lines / FIELDS][FIELDS]; might define an array that is too large for automatic storage, causing a stack overflow

  • The correct format specifier for size_t is %zu, not %ld. size_t is not a long and might not even have the same size or argument passing convention.

  • the compare function uses too many indirections in the type conversion. While your type conversions are probably OK except for const correctness, they are difficult to grasp for most programmers. You should use a simpler approach:

    int compare(const void *row1, const void *row2) {
         char * const *a = row1;
         char * const *b = row2;
    
         return strcmp(a[2], b[2]);
    }
    
  • note however that the above function will sort the important numbers in lexicographical order, placing 11 between 1 and 2. You probably want numeric order instead:

    int compare(const void *row1, const void *row2) {
         char * const *a = row1;
         char * const *b = row2;
         long na = strtol(a, NULL, 10);
         long nb = strtol(b, NULL, 10);
         return (na > nb) - (na < nb);
    }
    

CodePudding user response:

Many concerns - I'll mention only one

Counting lines

Do not count lines. (Drop step 2.) Rather than make 2 passes, use 1 pass and instead adjust data as needed.

Some untested code to give OP an idea:

  char *(*data)[FIELDS] = NULL;
  size_t records_n = 0;  // Allocation total
  size_t records_i;      // Allocation used

  for (records_i = 0; records_i < SIZE_MAX; records_i  ) {
    if (records_i == records_n) {
      size_t records_new_n = records_n * 2   1;  // Double the allocation
      char *(*newdata)[FIELDS] = realloc(data, sizeof data[0] * records_new_n);
      if (newdata == NULL) {
        free(data);
        fprintf(stderr, "Out of memory.\n");
        exit(EXIT_FAILURE);
      }
      data = newdata;
      records_n = records_new_n;
    }
    int f;
    for (f = 0; f < FIELDS; f  ) {
      data[records_i][f] = NULL;
      size_t n = 0;
      if (getline(&data[records_i][f], &n, fp) == -1) {
        if (f == 0) {
          break;
        }
        fprintf(stderr, "Record ended early.\n");
        break; // Or maybe fail?
      }
      // Lop off potential '\n'
      if (n > 0 && data[records_i][f][n - 1] == '\n') {
        data[records_i][f][--n] = 0;
      }
    }
    if (f < FIELDS) {
      break;
    }
  }
  // Perhaps right-size data to records_i here?  Not shown.

  // ... Use data

  // When all done, free all lines allocated (not shown) and ...
  free(data);
  • Related