Home > Mobile >  order() in data.frame and data.table
order() in data.frame and data.table

Time:04-29

I get different results if I use order() in data.frame and data.table. For example:

A <- data.frame(one=c("k"),two=c("3_28","31_60","48_68"))
B <- as.data.table(A)

A[order(A$one,A$two),]
  one   two
1   k  3_28
2   k 31_60
3   k 48_68


B[order(B$one, B$two),]
   one   two
1:   k 31_60
2:   k  3_28
3:   k 48_68

I must admit this was a bit of a nasty shock, as I have assumed equivalent results for order() from data.frame and data.table for many years. I guess there is a lot of code I need to check!

Is there any way to ensure order() gives the same results in data.frame and data.table?

Many apologies if this difference in behavior is already well known, and is just an example of my ignorance.

CodePudding user response:

When used inside of a data.table operation, order(..) uses data.table:::forder. According to the Introduction to data.table vignette:

order() is internally optimised

  • We can use "-" on a character columns within the frame of a data.table to sort in decreasing order.

  • In addition, order(...) within the frame of a data.table uses data.table's internal fast radix order forder(). This sort provided such a compelling improvement over R's base::order that the R project adopted the data.table algorithm as its default sort in 2016 for R 3.3.0, see ?sort and the R Release NEWS.

The key to see the difference is that it uses a "fast radix order". If you see base::order, though, it has an argument method= which

  method: the method to be used: partial matches are allowed.  The
          default ('"auto"') implies '"radix"' for short numeric
          vectors, integer vectors, logical vectors and factors.
          Otherwise, it implies '"shell"'.  For details of methods
          '"shell"', '"quick"', and '"radix"', see the help for 'sort'.

Since the second column of your data.table is not one of numeric, integer, logical, or factor, then base::order uses the "shell" method for sorting, which produces different results.

However, if we force base::order to use method="radix", we get the same result.

order(A$two)
# [1] 1 2 3
order(A$two, method="radix")
# [1] 2 1 3

A[order(A$one, A$two, method = "radix"),]
#   one   two
# 2   k 31_60
# 1   k  3_28
# 3   k 48_68

You can affect the same ordering by using base::order:

B[base::order(B$one,B$two),]
#       one    two
#    <char> <char>
# 1:      k   3_28
# 2:      k  31_60
# 3:      k  48_68
  • Related