I have been wondering this for some time - purely in terms of memory and processing efficiency, what is the best variable type to store in a dataframe column?
For example, I can store my variables as either strings or integers (as below). In this case, which of the columns would be more efficient, for a 1 million row dataset, and why?
string_col int_col
code1 1
code2 2
code3 3
CodePudding user response:
A rough approximation (this may change when you put it into a dataframe, which is another structure)
> object.size("code1")
112 bytes
> object.size(1)
56 bytes
Or alternatively
> object.size(df$string_col)
248 bytes
> object.size(df$int_col)
64 bytes
adding the string as a factor
> object.size(df$string_col_fact)
648 bytes
Using a bigger set:
n = 10^6
sapply(list(
str=data.frame(rep(c(paste0("code", 1:3)), n)),
int=data.frame(rep(1:3, n)),
strFactor=data.frame(factor(rep(c(paste0("code", 1:3)), n)))),
object.size)
# str int strFactor
# 24000920 12000736 12001352
CodePudding user response:
Under the hood, an R vector object is actually a symbol bound to a pointer (a VECSXP
). The VECSXP
points to the actual data-containing structure. The data we see in R as numeric vectors are stored as REALSXP
objects. These contain header flags, some pointers (e.g. to attributes), a couple of integers giving information about the length of the vector, and finally the actual numbers: an array of double-precision floating point numbers.
For character vectors, the data have to be stored in a slightly more complicated way. The VECSXP
points to a STRSXP
, which again has header flags, some pointers and a couple of numbers to describe the length of the vector, but what then follows is not an array of characters, but an array of pointers to character strings (more precisely, an array of SEXP
s pointing to CHARSXP
s). A CHARSXP
itself contains flags, pointers and length information, then an array of characters representing a string. Even for short strings, a CHARSXP
will take up a minimum of about 56 bytes on a 64-bit system.
The CHARSXP
objects are re-used, so if you have a vector of 1 million strings each saying "code1", the array of pointers in the STRSXP
should all point to the same CHARSXP
. There is therefore only a very small memory overhead of approximately 56 bytes between a one-million length vector of 1s and a one-million length vector of "1"s.
a <- rep(1, 1e6)
object.size(a)
#> 8000048 bytes
b <- rep("1", 1e6)
object.size(b)
#> 8000104 bytes
This is not the case when you have many different strings, since each different string will require its own CHARSXP
. For example, if we have 26 different strings within our 1-million long vector rather than just a single string, we will take up an extra 56 * (26 - 1) = 1400 bytes of memory:
c <- rep(letters, length.out = 1e6)
object.size(c)
#> 8001504 bytes
So the short answer to your question is that as long as the number of unique elements is small, there is little difference in the size of the underlying memory usage. However, a character vector will always require more memory than a numeric vector - even if the difference is very small.