Can't get strcoll() to use locales when sorting in C-CodePudding

I have not been able to get locale-dependent functions such as strcoll() to work in C. I am wondering whether I am doing something wrong and/or how to get this to work. Here is a sample program from this book: Prinz, Peter, and Tony Crawford. 2016. C in a Nutshell, 2nd edn., p. 574. Beijing-Boston-Farnham-Sebastopol-Tokyo: O'Reilly. ISBN-13: 978-1-491-90475-6.

#include <stdio.h>
#include <string.h>
#include <locale.h>

int main(void) {
   char *samples[ ] = { "curso", "churro" };
   setlocale(LC_COLLATE, "es_ES.UTF-8");
   int result = strcoll(samples[0], samples[1]);
   if(result == 0) {
      printf("The strings \"%s\" and \"%s\" are "
             "alphabetically equivalent.\n",
             samples[0], samples[1]);
   } else if(result < 0) {
      printf("The string \"%s\" comes before \"%s\" "
             "alphabetically.\n",
             samples[0], samples[1]);
   } else if(result > 0) {
      printf("The string \"%s\" comes after \"%s\" "
             "alphabetically.\n",
             samples[0], samples[1]);
   }
   return(0);
}

The book says that "curso" should come BEFORE "churro", because in Spanish "ch" is considered a separate letter for purposes of alphabetization. However, when I run this program it prints that "curso" comes AFTER "churro". I do not know Spanish, but I have tested this program with several other languages that I do know, and the result is always that of strcmp(), a strictly numerical comparison.

$ gcc --version
gcc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
$ locale -a | grep es_ES.utf8             
es_ES.utf8

I am aware of this question: Getting locale functions to work in glibc The author says that locale-dependent functions such as strcoll perform poorly in glibc, and that he was writing his own modifications of it.

Am I missing something? Does this simply not work?

CodePudding user response：

Your book has outdated information. The Spanish digraph ch is not considered a single letter since 1994. See https://rae.es/dpd/abecedario.

en el X Congreso de la Asociación de Academias de la Lengua Española, celebrado en 1994, se acordó adoptar el orden alfabético latino universal, en el que la ch y la ll no se consideran letras independientes."

(Hope no translation is needed)

You can also look at the Unicode collation data here. This is the source glibc derives its collation data from. As you can see, there are several collation orders. The standard one does not consider ch and ll special, while the traditional one does. Glibc implements the standard collation.

You can check that your Spanish locale collation is working by trying strings with accented characters. Those should come in the order described by the collation order (i.e. right after the corresponding non-accented character) if the system is working, and after all non-accented letters if it does not (i.e. if you forget to call setlocale or the locale is not supported). Demo Note, on godbolt GCC does not support locales, while MSVC does (and with the Unix-like locale names to boot).

If you want to test multi-character collation, use the Czech locale (cs_CZ.UTF-8), it does recognise ch as a single letter and it comes after h in the collation order. Demo.

CodePudding user response：

"Am I missing something? Does this simply not work?"

I think it comes down to whether or not your environment recognizes "es_ES.UTF-8"

Note, I do not have access to a Linux environment, which waters down the ability to compare apples with apples. But I hope the following highlights a few things that might help...

On Windows, and using a standard LabWindows/CVI compiler (my version is based on Clang 3.3) it outputs the following:

"The string "curso" comes after "churro" alphabetically."

which appears to be incorrect according to your stated expectations when using the Spanish alphabetization rules.

I suspect implementation and version of libraries contribute to what we are seeing.

Note that when later I checked the return of setlocale:

char *new = setlocale(LC_COLLATE, "es_ES.UTF-8");

It came back NULL, indicating the following:

"If locale is non-NULL and can be honored, a pointer to the string associated with the specified category is returned. If the All Categories setting is selected, then the strings contain a concatenation of the locales for the different categories." If the selection cannot be honored, the function returns a NULL pointer and the program's locale remains unchanged.

indicating that "es_ES.UTF-8" was not honored, leaving locale unchanged.
This article has some interesting and related insights into using UTF-8 in C. (...and how it relates to the locale problems seen here.)