How to choose between column types for efficiency in R?-CodePudding

I have been wondering this for some time - purely in terms of memory and processing efficiency, what is the best variable type to store in a dataframe column?

For example, I can store my variables as either strings or integers (as below). In this case, which of the columns would be more efficient, for a 1 million row dataset, and why?

string_col  int_col
code1       1
code2       2
code3       3

CodePudding user response：

A rough approximation (this may change when you put it into a dataframe, which is another structure)

> object.size("code1")
112 bytes
> object.size(1)
56 bytes

Or alternatively

> object.size(df$string_col)
248 bytes
> object.size(df$int_col)
64 bytes

adding the string as a factor

> object.size(df$string_col_fact)
648 bytes

Using a bigger set:

n = 10^6

sapply(list(
  str=data.frame(rep(c(paste0("code", 1:3)), n)),
  int=data.frame(rep(1:3, n)),
  strFactor=data.frame(factor(rep(c(paste0("code", 1:3)), n)))),
  object.size)

#      str       int strFactor 
# 24000920  12000736  12001352

CodePudding user response：

Under the hood, an R vector object is actually a symbol bound to a pointer (a VECSXP). The VECSXP points to the actual data-containing structure. The data we see in R as numeric vectors are stored as REALSXP objects. These contain header flags, some pointers (e.g. to attributes), a couple of integers giving information about the length of the vector, and finally the actual numbers: an array of double-precision floating point numbers.

For character vectors, the data have to be stored in a slightly more complicated way. The VECSXP points to a STRSXP, which again has header flags, some pointers and a couple of numbers to describe the length of the vector, but what then follows is not an array of characters, but an array of pointers to character strings (more precisely, an array of SEXPs pointing to CHARSXPs). A CHARSXP itself contains flags, pointers and length information, then an array of characters representing a string. Even for short strings, a CHARSXP will take up a minimum of about 56 bytes on a 64-bit system.

The CHARSXP objects are re-used, so if you have a vector of 1 million strings each saying "code1", the array of pointers in the STRSXP should all point to the same CHARSXP. There is therefore only a very small memory overhead of approximately 56 bytes between a one-million length vector of 1s and a one-million length vector of "1"s.

a <- rep(1, 1e6)
object.size(a)
#> 8000048 bytes

b <- rep("1", 1e6)
object.size(b)
#> 8000104 bytes

This is not the case when you have many different strings, since each different string will require its own CHARSXP. For example, if we have 26 different strings within our 1-million long vector rather than just a single string, we will take up an extra 56 * (26 - 1) = 1400 bytes of memory:

c <- rep(letters, length.out = 1e6)
object.size(c)
#> 8001504 bytes

So the short answer to your question is that as long as the number of unique elements is small, there is little difference in the size of the underlying memory usage. However, a character vector will always require more memory than a numeric vector - even if the difference is very small.