I am working with time series, i have 2 different time series that have 2 columns and different row number.
df_1=read.table("data_1")
df_2=read.table("data_2")
I would like to compare the values of df_1$V2
(second column) with the values in df_2$V2
, if they are equal calculate the time difference between them (df_2$V1[i]-df_2$V1[j])
here is my code:
vect=c()
horo=c()
j=1
for (i in 2: nrow(df_1)){
for(j in 1:nrow(df_2)) {
if(df_1$V2[i]==df_2$V2[j]){
calc=abs(df_2$V1[j] - df_1$V1[i])
vect=append(vect, calc)
}
}
}
The problem is:
- it could exist many element in
df_2$V2[j]
that are equal todf_1$V2[i]
and i only want the first value. - as i know that in my data if (for example)
df_1$V2[1]= df_2$V2[8]
so for the next iteration no need to compare thedf_1$V1[2]
with the first 8 values ofdf_2$V2
and i can start comparing fromdf_2$V2[9]
- it take too much time... because of the for loop, so is there another way to do it? Thank you for your help!
data example:
df_1=
15.942627 2633
15.942630 2664
15.942831 2699
15.943421 3068
15.943422 4256
15.943423 5444
15.943425 6632
15.943426 7820
15.945489 9008
15.945490 10196
15.945995 11384
15.960359 12572
15.960360 13760
15.960413 14948
15.960414 16136
15.961537 17202
15.962138 18390
15.962139 18624
16.042805 18659
16.043349 18851
....
df_2=
15.942244 2376
15.942332 2376
15.942332 2376
15.959306 2633
15.960350 2633
15.961223 3068
15.967225 6632
15.978364 10196
15.982280 12572
15.994296 16136
15.994379 18624
16.042336 18624
16.060262 18659
16.065397 21250
16.069239 24814
16.073407 28378
16.077236 31942
CodePudding user response:
You've mentioned that your for-loop is slow; it's generally advisable to avoid writing your own for-loops in R, and letting built-in vectorisation handle things efficiently.
Here's a non-for-loop-dependent solution using the popular dplyr
package from the tidyverse.
Read in data
First, let's read in your data for the sake of reproducibility. Note that I've added names to your data, because unnamed data is confusing and hard to work with.
library(vroom) # useful package for flexible data reading
df_1 <- vroom(
"timestamp value
15.942627 2633
15.942630 2664
15.942831 2699
15.943421 3068
15.943422 4256
15.943423 5444
15.943425 6632
15.943426 7820
15.945489 9008
15.945490 10196
15.945995 11384
15.960359 12572
15.960360 13760
15.960413 14948
15.960414 16136
15.961537 17202
15.962138 18390
15.962139 18624
16.042805 18659
16.043349 18851")
#> Rows: 20 Columns: 2
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: " "
#> dbl (2): timestamp, value
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
df_2 <- vroom(
"timestamp value
15.942244 2376
15.942332 2376
15.942332 2376
15.959306 2633
15.960350 2633
15.961223 3068
15.967225 6632
15.978364 10196
15.982280 12572
15.994296 16136
15.994379 18624
16.042336 18624
16.060262 18659
16.065397 21250
16.069239 24814
16.073407 28378
16.077236 31942")
#> Rows: 17 Columns: 2
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: " "
#> dbl (2): timestamp, value
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Comparing time differences for matching values
Let's go through the solution step-by-step:
Add id
for each row of df_1
We'll need this later to remove unwanted values.
require(dplyr)
#> Loading required package: dplyr
df_1 <- mutate(df_1, id = paste0("id_", row_number()))
df_1 <- relocate(df_1, id)
head(df_1)
#> # A tibble: 6 × 3
#> id timestamp value
#> <chr> <dbl> <dbl>
#> 1 id_1 15.9 2633
#> 2 id_2 15.9 2664
#> 3 id_3 15.9 2699
#> 4 id_4 15.9 3068
#> 5 id_5 15.9 4256
#> 6 id_6 15.9 5444
Join rows from df_2 on matching values
joined <- left_join(df_1, df_2, by = "value", suffix = c(".1", ".2"))
head(joined)
#> # A tibble: 6 × 4
#> id timestamp.1 value timestamp.2
#> <chr> <dbl> <dbl> <dbl>
#> 1 id_1 15.9 2633 16.0
#> 2 id_1 15.9 2633 16.0
#> 3 id_2 15.9 2664 NA
#> 4 id_3 15.9 2699 NA
#> 5 id_4 15.9 3068 16.0
#> 6 id_5 15.9 4256 NA
Get the first returned value for each value in df_1
We can do this by grouping by our id
column, then just getting the first()
row from each group.
joined <- group_by(joined, id) # group by row identifiers
summary <- summarise(joined, across(everything(), first))
head(summary)
#> # A tibble: 6 × 4
#> id timestamp.1 value timestamp.2
#> <chr> <dbl> <dbl> <dbl>
#> 1 id_1 15.9 2633 16.0
#> 2 id_10 15.9 10196 16.0
#> 3 id_11 15.9 11384 NA
#> 4 id_12 16.0 12572 16.0
#> 5 id_13 16.0 13760 NA
#> 6 id_14 16.0 14948 NA
Get time difference
A simple case of using mutate()
to subtract timestamp.1
from timestamp.2
:
times <- mutate(summary, time_diff = timestamp.2 - timestamp.1) |>
relocate(value, .after = id) # this is just for presentation
## You may want to remove rows with no time diff?
filter(times, !is.na(time_diff))
#> # A tibble: 8 × 5
#> id value timestamp.1 timestamp.2 time_diff
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 id_1 2633 15.9 16.0 0.0167
#> 2 id_10 10196 15.9 16.0 0.0329
#> 3 id_12 12572 16.0 16.0 0.0219
#> 4 id_15 16136 16.0 16.0 0.0339
#> 5 id_18 18624 16.0 16.0 0.0322
#> 6 id_19 18659 16.0 16.1 0.0175
#> 7 id_4 3068 15.9 16.0 0.0178
#> 8 id_7 6632 15.9 16.0 0.0238
Created on 2022-10-25 with reprex v2.0.2