Home > database >  R and Python have different character sort order
R and Python have different character sort order

Time:01-24

I'm trying to sort a dataframe in R and discovered the sort order does not match the expected ascii sort order. I need to sort a dataframe in R in the same way Python sorts the data.

df = df[do.call(order, df), ]  # sort by all columns

As shown here Python correctly sorts uppercase letters before lowercase letters:

$ python
Python 3.7.12 | packaged by conda-forge | (default, Oct 26 2021, 06:08:53)
>>> "A" < "a"
True

But R sorts uppercase letters after lowercase letters:

$ R
R version 3.2.0 (2015-04-16) -- "Full of Ingredients"
Copyright (C) 2015 The R Foundation for Statistical Computing
Platform: x86_64-unknown-linux-gnu (64-bit)
  Natural language support but running in an English locale
> "A" < "a"
[1] FALSE
> "A" > "a"
[1] TRUE

How can I change the R sort behavior to match the standard ascii ordering? Is there some parameter to the order function, or some configuration setting to change the sort order?

Note: this is not a distinction between case-sensitive and case-insensitive sorting -- it's worse than that -- the case sensitive sorting has a non-standard order.

CodePudding user response:

Different locales use different sort orders, including case rules: you probably want to use Sys.setlocale(locale = "C"). (There is more information about locale definitions and case sorting order here.)

?Comparison says a little bit about locale-specific sorting ...

The collating sequence of locales such as ‘en_US’ is normally different from ‘C’ (which should use ASCII) and can be surprising.

... but as far as I can see does not say anything explicit about case order (searching for "case" in the page didn't get any hits).

> Sys.getlocale()
[1] "en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8"
> "A" < "a"
[1] FALSE
> Sys.setlocale(locale = "C")
[1] "C/C/C/C/C/en_CA.UTF-8"
> "A" < "a"
[1] TRUE
  • Related