Home > front end >  Efficient way to convert <dattm.dt> to date in R
Efficient way to convert <dattm.dt> to date in R

Time:09-28

Here is my issue: I am using reticulate to source python data frames from a database. One of the variables is in a date format. When I do the conversion from python to R, the date variable gets transformed into a list object and all the entries show as <dattm.dt> (see dput below): I have been dealing with the problem as follows:

library(tidyverse);library(reticulate); library(lubridate)

date_strings <- x %>% pull(date_object) ##Retrieve the date listt
fixed_dates <- sapply(1:length(date_strings), function(j){
        p <- py_to_r(date_strings[[j]])
        return(p)} %>% as_date() ##Apply function to fix each entry individually

##Dput below
structure(list(date_object = list(<environment>, <environment>, 
    <environment>, <environment>, <environment>, <environment>, 
    <environment>, <environment>, <environment>, <environment>, 
    <environment>, <environment>, <environment>, <environment>, 
    <environment>, <environment>, <environment>, <environment>, 
    <environment>, <environment>), metric = c(0.216754862863576, 
-0.542492572263425, 0.891144645072327, 0.595980577187475, 1.63561800111297, 
0.689275441919723, -1.28124663010116, -0.213144519278363, 1.89653987190927, 
1.77686321368272, 0.566604498180317, 0.01571945400457, 0.383057338517151, 
-0.0451371159133086, 0.0343519073969926, 0.169026774218306, 1.16502683902767, 
-0.0442039972520874, -0.100368442585905, -0.283444568873591)), row.names = c(NA, 
-20L), class = c("tbl_df", "tbl", "data.frame"), pandas.index = <environment>)

Here are the top elements of the date_strings object:

[[1]]
<environment: 0x7f904dc4d5b8>
attr(,"class")
[1] "datetime.date"         "python.builtin.object"

[[2]]
<environment: 0x7f904dc4d430>
attr(,"class")
[1] "datetime.date"         "python.builtin.object"

[[3]]
<environment: 0x7f904dc4d318>
attr(,"class")
[1] "datetime.date"         "python.builtin.object"

While this approach works well for small datasets, it takes a really long time when the data frame is big (think thousands of rows). Is there a way to optimize the process or to vectorize it?

CodePudding user response:

We may use lapply instead of sapply and convert to a vector with c using do.call. The reason is that if the evaluated dates are Date class, c` will not coerce it to integer mode

do.call(c, lapply(seq_along(date_strings), 
        function(j) py_to_r(date_strings[[j]])))
  • Related