apologies for the rather rudimentary questions, but I haven't been able to easily find any answers, and also just want some solid confirmation on things.
I have a data frame which contains numeric, factor and ordered factor variables, and when I converted this to a matrix using as.matrix, I noted that the elements of the matrix were all characters. From this experience, I have 2 questions;
First, am I right in saying that vectors and matrices can only contain one data type, and this is why coercion occurs?
Secondly, and more importantly, what combinations of data types in a data frame lead to character matrices vs. numeric matrices etc? e.g. If I had just logical, integer and numeric types in my df, I imagine I would get a numeric matrix, is this correct? So is it just the inclusion of factors, ordered factors and/or characters in my data frame that, when converted into a matrix, brings about the coercion of every element into a character?
Thanks so much for reading, any help is appreciated :]
CodePudding user response:
Answer to your first question: yes and no.
Actually, a matrix is a vector with a dim
attribute.
And a vector must usually have one data type only. A list
is an exception: it's a vector
with list
mode, and a list may also have a dim
attribute.
For instance:
> is.vector(list(1, "a", T))
[1] TRUE
> mode(list(1, "a", T))
[1] "list"
> a <- structure(list(1, "a", T, 1 2i), dim = c(2, 2))
> is.matrix(a)
[1] TRUE
> a
[,1] [,2]
[1,] 1 TRUE
[2,] "a" 1 2i
But it's still probably the reason as.matrix
is doing coercion: it's much easier to convert everything to a single type and deal with a matrix with elements of a single type.
However, it's a choice made by as.matrix
, and it would be possible, though I think unadvisable, to convert a data.frame to a list-matrix, while keeping all data types intact.
It would be inefficient: vectors can be stored in contiguous memory locations, which means 1/ no memory wasted in storing element data types, and 2/ faster processing with vectorized code 3/ external C or Fortran code expects contiguous data types, and it would be cumbersome and useless to deal with lists. I have never seen a list-matrix actually used, though I guess it might help in some circumstances.
The answer to your second question is in the documentation of as.matrix
:
as.matrix is a generic function. The method for data frames will return a character matrix if there is only atomic columns and any non-(numeric/logical/complex) column, applying as.vector to factors and format to other non-character columns. Otherwise, the usual coercion hierarchy (logical < integer < double < complex) will be used, e.g., all-logical data frames will be coerced to a logical matrix, mixed logical-integer will give a integer matrix, etc.
You may also have a look at the source code of as.matrix.data.frame
.