Home > database >  (R) Question regarding type coercion when converting a data frame to a matrix in R
(R) Question regarding type coercion when converting a data frame to a matrix in R

Time:03-17

apologies for the rather rudimentary questions, but I haven't been able to easily find any answers, and also just want some solid confirmation on things.

I have a data frame which contains numeric, factor and ordered factor variables, and when I converted this to a matrix using as.matrix, I noted that the elements of the matrix were all characters. From this experience, I have 2 questions;

First, am I right in saying that vectors and matrices can only contain one data type, and this is why coercion occurs?

Secondly, and more importantly, what combinations of data types in a data frame lead to character matrices vs. numeric matrices etc? e.g. If I had just logical, integer and numeric types in my df, I imagine I would get a numeric matrix, is this correct? So is it just the inclusion of factors, ordered factors and/or characters in my data frame that, when converted into a matrix, brings about the coercion of every element into a character?

Thanks so much for reading, any help is appreciated :]

CodePudding user response:

Answer to your first question: yes and no.

Actually, a matrix is a vector with a dim attribute.

And a vector must usually have one data type only. A list is an exception: it's a vector with list mode, and a list may also have a dim attribute.

For instance:

> is.vector(list(1, "a", T))
[1] TRUE

> mode(list(1, "a", T))
[1] "list"

> a <- structure(list(1, "a", T, 1 2i), dim = c(2, 2))
> is.matrix(a)
[1] TRUE

> a
     [,1] [,2]
[1,] 1    TRUE
[2,] "a"  1 2i

But it's still probably the reason as.matrix is doing coercion: it's much easier to convert everything to a single type and deal with a matrix with elements of a single type.

However, it's a choice made by as.matrix, and it would be possible, though I think unadvisable, to convert a data.frame to a list-matrix, while keeping all data types intact.

It would be inefficient: vectors can be stored in contiguous memory locations, which means 1/ no memory wasted in storing element data types, and 2/ faster processing with vectorized code 3/ external C or Fortran code expects contiguous data types, and it would be cumbersome and useless to deal with lists. I have never seen a list-matrix actually used, though I guess it might help in some circumstances.


The answer to your second question is in the documentation of as.matrix:

as.matrix is a generic function. The method for data frames will return a character matrix if there is only atomic columns and any non-(numeric/logical/complex) column, applying as.vector to factors and format to other non-character columns. Otherwise, the usual coercion hierarchy (logical < integer < double < complex) will be used, e.g., all-logical data frames will be coerced to a logical matrix, mixed logical-integer will give a integer matrix, etc.

You may also have a look at the source code of as.matrix.data.frame.

  • Related