How to automatically figure out value types in a data.table and convert them accordingly in R?-CodePudding

Lets say, I loaded a dataset from somewhere and all columns in it are characters, like below

>diamonds2
        carat       cut  color  clarity  depth  table  price      x      y      z
       <char>    <char> <char>   <char> <char> <char> <char> <char> <char> <char>
    1:   0.23     Ideal      E SPECIAL2   61.5     55    326   3.95   3.98   2.43
    2:   0.21   Premium      E SPECIAL1   59.8     61    326   3.89   3.84   2.31
    3:   0.23      Good      E      VS1   56.9     65    327   4.05   4.07   2.31

But in fact some of them are not characters (could be Date, Time, numeric, integer).

What's the smartest / fastest way to automatically convert all variables in such dataset into their appropriate types (e.g. 'carat' should be numeric, 'table' should be integer and so on, and if there is a variable that looks like "2000-12-31" it will be automatically converted to Date - just like in PowerBI, for example )?

I know the hack - to save data locally as .csv and then open it using fread(). But how to convert it without saving

My file is quite big, so I need to do it in memory efficiently (i.e. using data.table)

CodePudding user response：

The type_convert from readr can pick up the Date as well. With base R, we can use type.convert, but it may not do the Date conversion

df1 <- readr::type_convert(df)

-checking

> str(df)
'data.frame':   10 obs. of  3 variables:
 $ x   : chr  "0.933409547200426" "0.818039098987356" "0.899821242550388" "0.475551090203226" ...
 $ y   : chr  "8" "5" "2" "9" ...
 $ date: chr  "2022-10-19" "2022-10-20" "2022-10-21" "2022-10-22" ...

> str(df1)
'data.frame':   10 obs. of  3 variables:
 $ x   : num  0.933 0.818 0.9 0.476 0.19 ...
 $ y   : num  8 5 2 9 10 3 7 1 6 4
 $ date: Date, format: "2022-10-19" "2022-10-20" "2022-10-21" "2022-10-22" ...

data

df <- structure(list(x = c("0.933409547200426", "0.818039098987356", 
"0.899821242550388", "0.475551090203226", "0.190425756154582", 
"0.740122902207077", "0.760023688431829", "0.97861848748289", 
"0.586465073982254", "0.724658250808716"), y = c("8", "5", "2", 
"9", "10", "3", "7", "1", "6", "4"), date = c("2022-10-19", "2022-10-20", 
"2022-10-21", "2022-10-22", "2022-10-23", "2022-10-24", "2022-10-25", 
"2022-10-26", "2022-10-27", "2022-10-28")), 
class = "data.frame", row.names = c(NA, 
-10L))

CodePudding user response：

I'd still use write/read with correct classes approach, no need to write the full dataset, try:

#example input, all cols are character
diamonds2 <- ggplot2::diamonds
diamonds2[] <- lapply(diamonds2, as.character)
#check
table(sapply(diamonds2, class))
# character 
#        10

library(data.table)
#write the head and read to get classes
f <- tempfile()
fwrite(head(diamonds2), file = f)
x <- fread(f)

Then apply classes using set:

cc <- sapply(x, class)
setDT(diamonds2)

for(col in names(cc)) set(diamonds2, j = col, value = as(diamonds2[[ col ]], cc[ col ]))

#check
table(sapply(diamonds2, class))
# character   integer   numeric 
#         3         2         5