Lets say, I loaded a dataset from somewhere and all columns in it are characters, like below
>diamonds2
carat cut color clarity depth table price x y z
<char> <char> <char> <char> <char> <char> <char> <char> <char> <char>
1: 0.23 Ideal E SPECIAL2 61.5 55 326 3.95 3.98 2.43
2: 0.21 Premium E SPECIAL1 59.8 61 326 3.89 3.84 2.31
3: 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
But in fact some of them are not characters (could be Date, Time, numeric, integer).
What's the smartest / fastest way to automatically convert all variables in such dataset into their appropriate types (e.g. 'carat' should be numeric, 'table' should be integer and so on, and if there is a variable that looks like "2000-12-31" it will be automatically converted to Date - just like in PowerBI, for example )?
I know the hack - to save data locally as .csv and then open it using fread()
. But how to convert it without saving
My file is quite big, so I need to do it in memory efficiently (i.e. using data.table)
CodePudding user response:
The type_convert
from readr
can pick up the Date as well. With base R
, we can use type.convert
, but it may not do the Date conversion
df1 <- readr::type_convert(df)
-checking
> str(df)
'data.frame': 10 obs. of 3 variables:
$ x : chr "0.933409547200426" "0.818039098987356" "0.899821242550388" "0.475551090203226" ...
$ y : chr "8" "5" "2" "9" ...
$ date: chr "2022-10-19" "2022-10-20" "2022-10-21" "2022-10-22" ...
> str(df1)
'data.frame': 10 obs. of 3 variables:
$ x : num 0.933 0.818 0.9 0.476 0.19 ...
$ y : num 8 5 2 9 10 3 7 1 6 4
$ date: Date, format: "2022-10-19" "2022-10-20" "2022-10-21" "2022-10-22" ...
data
df <- structure(list(x = c("0.933409547200426", "0.818039098987356",
"0.899821242550388", "0.475551090203226", "0.190425756154582",
"0.740122902207077", "0.760023688431829", "0.97861848748289",
"0.586465073982254", "0.724658250808716"), y = c("8", "5", "2",
"9", "10", "3", "7", "1", "6", "4"), date = c("2022-10-19", "2022-10-20",
"2022-10-21", "2022-10-22", "2022-10-23", "2022-10-24", "2022-10-25",
"2022-10-26", "2022-10-27", "2022-10-28")),
class = "data.frame", row.names = c(NA,
-10L))
CodePudding user response:
I'd still use write/read with correct classes approach, no need to write the full dataset, try:
#example input, all cols are character
diamonds2 <- ggplot2::diamonds
diamonds2[] <- lapply(diamonds2, as.character)
#check
table(sapply(diamonds2, class))
# character
# 10
library(data.table)
#write the head and read to get classes
f <- tempfile()
fwrite(head(diamonds2), file = f)
x <- fread(f)
Then apply classes using set:
cc <- sapply(x, class)
setDT(diamonds2)
for(col in names(cc)) set(diamonds2, j = col, value = as(diamonds2[[ col ]], cc[ col ]))
#check
table(sapply(diamonds2, class))
# character integer numeric
# 3 2 5
Related posts: