In R, do factors somehow save space?-CodePudding

If you have a .csv file where most of the values for most variables are repeated, the final filesize of the file will not be small because there is no compression. However, if a .csv file is read into R and the appropriate variables are coerced into factors, will there be a compression benefit of some kind on the dataframe or the tibble? The repetition of factors throughout a dataframe or a tibble seems like a great opportunity to compress, but I don't know if this actually happens.

I tried searching for this online, but I didn't find answers. I'm not sure where to look for the way factors are implemented.

CodePudding user response：

The documentation you are looking for is at the ?factor help page:

factor returns an object of class "factor" which has a set of integer codes the length of x with a "levels" attribute of mode character and unique (!anyDuplicated(.)) entries.

So a factor is really just an integer vector along with a mapping (stored as an attribute) between the integer number and it's label/level. Nicely space efficient if you have repeats!

However, later we see:

Note

In earlier versions of R, storing character data as a factor was more space efficient if there is even a small proportion of repeats. However, identical character strings now share storage, so the difference is small in most cases. (Integer values are stored in 4 bytes whereas each reference to a character string needs a pointer of 4 or 8 bytes.)

So, in older versions of R factors could be much more space efficient, but newer versions have optimized character vector storage, so this difference isn't so big.

We can see the current difference:

n = 1e6
char = sample(letters, size = n, replace = T)
fact = factor(char)

object.size(char)
# 8001504 bytes
object.size(fact)
# 4002096 bytes