I have a list of 106 tibbles, which are time series, within data_sensor. Every tibble has two columns with Date and temperature respectively.
On the other hand I have a list of 106 dates in date_admin which contain the dates where I want my time series to end by tibble.
The code works properly but using the nested for-loop it takes too much time as the average number of rows is close to 10000th per tibble.
library(tidyverse)
library(dplyr)
#List nesting all the dataframes of all the xls files
files <- dir("C:/User*inals", pattern = "\\.xls$", full.names = TRUE)
data_sensor <- lapply(files, read_xls)
##List nesting all the dataframes of all the xlsx files
filesx <- dir("C:/Us******ls", pattern = "\\.xlsx$", full.names = TRUE)
data_generic <- lapply(filesx, read_xlsx)
idxend=vector()
for (i in seq_along(data_sensor)){
for (j in seq_along(data_sensor[[i]][[1]])){
if (as.Date(data_sensor[[i]][[1]][[j]]) < as.Date(date_admin[i])){
data_sensor[[i]][[1]][[j]] = data_sensor[[i]][[1]][[j]]
} else{ #Convert all the elements after condition to NA's
data_sensor[[i]][[1]][[j]] = NA
data_sensor[[i]][[2]][[j]] = NA
}
}
#Drop all NA's
for (i in seq_along(data_sensor)){
data_sensor[[i]] = drop_na(data_sensor[[i]])
}
}
In order to clarify my list of tibbles and vector:
> data_sensor[[1]][[1]][[1]]
[1] "2018-08-07 11:00:31 UTC"
> data_sensor[[1]][[2]][[1]]
[1] 6.3
> data_sensor[[2]][[1]][[1]]
[1] "2018-08-08 11:56:05 UTC"
#data_sensor[[index of list]][[column of tibble(date,Temperature)]][[row of tibble]]
> date_admin
[1] "2018-10-07 UTC" "2018-12-29 UTC" "2018-12-13 UTC" "2019-08-09 UTC" "2019-10-10 UTC"
[6] "2019-04-26 UTC" "2018-11-21 UTC" "2018-08-23 UTC" "2019-07-08 UTC" "2019-11-19 UTC"
[11] "2019-11-07 UTC" "2018-09-05 UTC" "2018-09-03 UTC" "2018-09-24 UTC" "2018-10-11 UTC"
[16] "2018-09-25 UTC" "2019-03-29 UTC" "2018-08-20 UTC" "2018-09-17 UTC" "2019-03-30 UTC"
[21] "2018-11-07 UTC" "2019-01-01 UTC" "2018-08-31 UTC" "2019-03-27 UTC" "2019-11-10 UTC"
[26] "2019-04-04 UTC" "2019-10-18 UTC" "2018-09-06 UTC" "2018-09-23 UTC" "2018-09-22 UTC"
[31] "2019-07-22 UTC" "2018-09-04 UTC" "2019-05-17 UTC" "2018-11-05 UTC" "2018-12-09 UTC"
[36] "2018-09-03 UTC" "2019-05-21 UTC" "2019-02-22 UTC" "2018-08-30 UTC" "2019-06-04 UTC"
[41] "2018-09-13 UTC" "2018-10-14 UTC" "2019-11-08 UTC" "2018-08-30 UTC" "2019-04-12 UTC"
[46] "2018-09-24 UTC" "2018-08-22 UTC" "2018-08-30 UTC" "2018-09-07 UTC" "2018-11-11 UTC"
[51] "2018-11-01 UTC" "2018-10-01 UTC" "2018-10-22 UTC" "2018-12-03 UTC" "2019-06-06 UTC"
[56] "2018-09-09 UTC" "2018-09-10 UTC" "2018-09-24 UTC" "2018-10-11 UTC" "2018-11-30 UTC"
[61] "2018-09-20 UTC" "2019-11-20 UTC" "2018-10-11 UTC" "2018-10-09 UTC" "2018-09-27 UTC"
[66] "2019-11-11 UTC" "2018-10-04 UTC" "2018-09-14 UTC" "2019-04-27 UTC" "2018-09-04 UTC"
[71] "2018-09-11 UTC" "2018-08-14 UTC" "2018-09-01 UTC" "2018-10-01 UTC" "2018-09-25 UTC"
[76] "2018-09-28 UTC" "2018-09-29 UTC" "2018-10-11 UTC" "2019-03-26 UTC" "2018-10-26 UTC"
[81] "2018-11-21 UTC" "2018-12-02 UTC" "2018-09-08 UTC" "2019-01-08 UTC" "2018-11-07 UTC"
[86] "2019-02-05 UTC" "2019-01-21 UTC" "2018-09-11 UTC" "2018-12-17 UTC" "2019-01-15 UTC"
[91] "2018-08-28 UTC" "2019-01-08 UTC" "2019-05-14 UTC" "2019-01-21 UTC" "2018-11-12 UTC"
[96] "2018-10-26 UTC" "2019-12-26 UTC" "2020-01-03 UTC" "2020-01-06 UTC" "2020-02-26 UTC"
[101] "2020-02-14 UTC" "2020-01-27 UTC" "2020-01-21 UTC" "2020-03-16 UTC" "2020-02-26 UTC"
[106] "2019-12-31 UTC"
data_sensor[[1]]
date Temperature
1 2018-08-07 11:00:31 6.3
2 2018-08-07 11:10:31 11.4
3 2018-08-07 11:20:31 12.0
4 2018-08-07 11:30:31 13.7
5 2018-08-07 11:40:31 15.6
6 2018-08-07 11:50:31 13.6
7 2018-08-07 12:00:31 12.2
8 2018-08-07 12:10:31 11.2
9 2018-08-07 12:20:31 11.6
...............................
...............................
...............................
499 2018-08-10 22:00:31 9.7
500 2018-08-10 22:10:31 9.6
[ reached 'max' / getOption("max.print") -- omitted 8592 rows ]
It takes some minutes to clean the data through the nested for loops. How can I improve the performance of my code?
CodePudding user response:
If you convert the whole thing in a vectorized sub-setting operation it should be an order of magnitude quicker e.g.:
for (i in seq_along(data_sensor)){
data_sensor[[i]]<- data_sensor[[i]][as.Date(data_sensor[[i]]$date)<as.Date(date_admin[i]),]
}
For loops are generally some what slow it is better to avoid nested loops and use vectorized operations where possible
ps by lack of data I could not try this
CodePudding user response:
Absolutely don't do it in a loop !! There are much more efficient methods for such operations. I'll show you how to do it. But first I need to generate some data. For this, I created two little functions. rndDate
randomizes the start date from "1/1/2018" to "12/31/2020", while fDateSensor
returns tibble
with a time series every 10 minutes.
rndDate = function(start_date=ymd("20180101"), end_date=ymd("20201231")){
sample(seq(start_date, end_date, "days"), 1)}
fDateSensor = function(n) tibble(
date = rndDate() 1:n*dminutes(10),
Temperature = rnorm(n)
)
fDateSensor(5)
output
# A tibble: 5 x 2
date Temperature
<dttm> <dbl>
1 2019-09-27 00:10:00 -0.511
2 2019-09-27 00:20:00 0.531
3 2019-09-27 00:30:00 1.42
4 2019-09-27 00:40:00 0.252
5 2019-09-27 00:50:00 -0.570
Now I'm going to make a tibble with internal tibble. First, for two dates, date_admin
.
nDateSensor = 10
set.seed(1234)
date_admin = c("2018-10-07", "2019-07-29")
data_sensor =
tibble(
file = paste("file",1:length(date_admin)),
date_admin = date_admin
) %>%
mutate(data_sensor = map(file, ~fDateSensor(nDateSensor)))
data_sensor
output
# A tibble: 2 x 3
file date_admin data_sensor
<chr> <chr> <list>
1 file 1 2018-10-07 <tibble [10 x 2]>
2 file 2 2019-07-29 <tibble [10 x 2]>
As you can see, I simulated reading two files. Their content is in the variable data_sensor
which is tibble
of size 10x2.
data_sensor$data_sensor
[[1]]
# A tibble: 10 x 2
date Temperature
<dttm> <dbl>
1 2020-10-14 00:10:00 0.314
2 2020-10-14 00:20:00 0.359
3 2020-10-14 00:30:00 -0.730
4 2020-10-14 00:40:00 0.0357
5 2020-10-14 00:50:00 0.113
6 2020-10-14 01:00:00 1.43
7 2020-10-14 01:10:00 0.983
8 2020-10-14 01:20:00 -0.622
9 2020-10-14 01:30:00 -0.732
10 2020-10-14 01:40:00 -0.517
[[2]]
# A tibble: 10 x 2
date Temperature
<dttm> <dbl>
1 2019-07-28 00:10:00 -0.776
2 2019-07-28 00:20:00 0.0645
3 2019-07-28 00:30:00 0.959
4 2019-07-28 00:40:00 -0.110
5 2019-07-28 00:50:00 -0.511
6 2019-07-28 01:00:00 -0.911
7 2019-07-28 01:10:00 -0.837
8 2019-07-28 01:20:00 2.42
9 2019-07-28 01:30:00 0.134
10 2019-07-28 01:40:00 -0.491
Now for the most important moment. We will build a function f
to modify our internal tibble
according to your expectations.
f = function(data) {
data$data_sensor[[1]] = data$data_sensor[[1]] %>% mutate(
date = ifelse(date<data$date_admin, NA, date) %>% as_datetime(),
Temperature = ifelse(date<data$date_admin, NA, Temperature)
)
data %>% mutate(nNA = sum(is.na(data$data_sensor[[1]]$date)))
}
data_sensor = data_sensor %>%
group_by(file) %>%
group_modify(~f(.x))
data_sensor$data_sensor
output
data_sensor$data_sensor
[[1]]
# A tibble: 10 x 2
date Temperature
<dttm> <dbl>
1 2020-10-14 00:10:00 0.314
2 2020-10-14 00:20:00 0.359
3 2020-10-14 00:30:00 -0.730
4 2020-10-14 00:40:00 0.0357
5 2020-10-14 00:50:00 0.113
6 2020-10-14 01:00:00 1.43
7 2020-10-14 01:10:00 0.983
8 2020-10-14 01:20:00 -0.622
9 2020-10-14 01:30:00 -0.732
10 2020-10-14 01:40:00 -0.517
[[2]]
# A tibble: 10 x 2
date Temperature
<dttm> <lgl>
1 NA NA
2 NA NA
3 NA NA
4 NA NA
5 NA NA
6 NA NA
7 NA NA
8 NA NA
9 NA NA
10 NA NA
As you can see, everything works great.
Additionally, our f
function, apart from the data_sensor
mutation, returns the number of NA
observations.
# A tibble: 2 x 4
# Groups: file [2]
file date_admin data_sensor nNA
<chr> <chr> <list> <int>
1 file 1 2018-10-07 <tibble [10 x 2]> 0
2 file 2 2019-07-29 <tibble [10 x 2]> 10
So it's time to test it on a bit bigger data. Here I used your date_admin
vector and drew 106 tibbles
each containing 100000 observations!
date_admin = c(
"2018-10-07", "2018-12-29", "2018-12-13", "2019-08-09", "2019-10-10",
"2019-04-26", "2018-11-21", "2018-08-23", "2019-07-08", "2019-11-19",
"2019-11-07", "2018-09-05", "2018-09-03", "2018-09-24", "2018-10-11",
"2018-09-25", "2019-03-29", "2018-08-20", "2018-09-17", "2019-03-30",
"2018-11-07", "2019-01-01", "2018-08-31", "2019-03-27", "2019-11-10",
"2019-04-04", "2019-10-18", "2018-09-06", "2018-09-23", "2018-09-22",
"2019-07-22", "2018-09-04", "2019-05-17", "2018-11-05", "2018-12-09",
"2018-09-03", "2019-05-21", "2019-02-22", "2018-08-30", "2019-06-04",
"2018-09-13", "2018-10-14", "2019-11-08", "2018-08-30", "2019-04-12",
"2018-09-24", "2018-08-22", "2018-08-30", "2018-09-07", "2018-11-11",
"2018-11-01", "2018-10-01", "2018-10-22", "2018-12-03", "2019-06-06",
"2018-09-09", "2018-09-10", "2018-09-24", "2018-10-11", "2018-11-30",
"2018-09-20", "2019-11-20", "2018-10-11", "2018-10-09", "2018-09-27",
"2019-11-11", "2018-10-04", "2018-09-14", "2019-04-27", "2018-09-04",
"2018-09-11", "2018-08-14", "2018-09-01", "2018-10-01", "2018-09-25",
"2018-09-28", "2018-09-29", "2018-10-11", "2019-03-26", "2018-10-26",
"2018-11-21", "2018-12-02", "2018-09-08", "2019-01-08", "2018-11-07",
"2019-02-05", "2019-01-21", "2018-09-11", "2018-12-17", "2019-01-15",
"2018-08-28", "2019-01-08", "2019-05-14", "2019-01-21", "2018-11-12",
"2018-10-26", "2019-12-26", "2020-01-03", "2020-01-06", "2020-02-26",
"2020-02-14", "2020-01-27", "2020-01-21", "2020-03-16", "2020-02-26",
"2019-12-31")
nDateSensor = 100000
set.seed(1234)
data_sensor =
tibble(
file = paste("file",1:length(date_admin)),
date_admin = date_admin
) %>%
mutate(data_sensor = map(file, ~fDateSensor(nDateSensor)))
output
data_sensor
# A tibble: 106 x 3
file date_admin data_sensor
<chr> <chr> <list>
1 file 1 2018-10-07 <tibble [100,000 x 2]>
2 file 2 2018-12-29 <tibble [100,000 x 2]>
3 file 3 2018-12-13 <tibble [100,000 x 2]>
4 file 4 2019-08-09 <tibble [100,000 x 2]>
5 file 5 2019-10-10 <tibble [100,000 x 2]>
6 file 6 2019-04-26 <tibble [100,000 x 2]>
7 file 7 2018-11-21 <tibble [100,000 x 2]>
8 file 8 2018-08-23 <tibble [100,000 x 2]>
9 file 9 2019-07-08 <tibble [100,000 x 2]>
10 file 10 2019-11-19 <tibble [100,000 x 2]>
# ... with 96 more rows
Time for a mutation. We will immediately measure how long it will take.
start_time =Sys.time()
data_sensor = data_sensor %>%
group_by(file) %>%
group_modify(~f(.x))
Sys.time()-start_time
For me it took 2.3 seconds. I don't know if you expected such time but it seems to be quite a good result.
Let's see what our data_sensor
looks like.
# A tibble: 106 x 4
# Groups: file [106]
file date_admin data_sensor nNA
<chr> <chr> <list> <int>
1 file 1 2018-10-07 <tibble [100,000 x 2]> 0
2 file 10 2019-11-19 <tibble [100,000 x 2]> 19001
3 file 100 2020-02-26 <tibble [100,000 x 2]> 95897
4 file 101 2020-02-14 <tibble [100,000 x 2]> 7769
5 file 102 2020-01-27 <tibble [100,000 x 2]> 99497
6 file 103 2020-01-21 <tibble [100,000 x 2]> 0
7 file 104 2020-03-16 <tibble [100,000 x 2]> 50969
8 file 105 2020-02-26 <tibble [100,000 x 2]> 0
9 file 106 2019-12-31 <tibble [100,000 x 2]> 13673
10 file 11 2019-11-07 <tibble [100,000 x 2]> 16697
# ... with 96 more rows
As you can see, some of the data has been changed to NA
. So everything worked fine.
All you have to do is read the xls file names into data_sensor
and then usinggroup_by (file)
and group_modify
to load the data into the variable data_sensor
. Good luck!