How to extract timestamp difference of the first 2 equal values from 2 data frame-CodePudding

I am working with time series, i have 2 different time series that have 2 columns and different row number.

df_1=read.table("data_1")
df_2=read.table("data_2")

I would like to compare the values of df_1$V2 (second column) with the values in df_2$V2, if they are equal calculate the time difference between them (df_2$V1[i]-df_2$V1[j]) here is my code:

 vect=c()
  horo=c()
  j=1
  for (i in 2: nrow(df_1)){
    for(j in 1:nrow(df_2)) {
      if(df_1$V2[i]==df_2$V2[j]){
               calc=abs(df_2$V1[j] - df_1$V1[i])
               vect=append(vect, calc)
      }      
     }    
    }

The problem is:

it could exist many element in df_2$V2[j] that are equal to df_1$V2[i] and i only want the first value.
as i know that in my data if (for example) df_1$V2[1]= df_2$V2[8] so for the next iteration no need to compare the df_1$V1[2] with the first 8 values of df_2$V2 and i can start comparing from df_2$V2[9]
it take too much time... because of the for loop, so is there another way to do it? Thank you for your help!

data example:

df_1=

15.942627 2633
15.942630 2664
15.942831 2699
15.943421 3068
15.943422 4256
15.943423 5444
15.943425 6632
15.943426 7820
15.945489 9008
15.945490 10196
15.945995 11384
15.960359 12572
15.960360 13760
15.960413 14948
15.960414 16136
15.961537 17202
15.962138 18390
15.962139 18624
16.042805 18659
16.043349 18851
....

df_2=

15.942244 2376
15.942332 2376
15.942332 2376
15.959306 2633
15.960350 2633
15.961223 3068
15.967225 6632
15.978364 10196
15.982280 12572
15.994296 16136
15.994379 18624
16.042336 18624
16.060262 18659
16.065397 21250
16.069239 24814
16.073407 28378
16.077236 31942

CodePudding user response：

You've mentioned that your for-loop is slow; it's generally advisable to avoid writing your own for-loops in R, and letting built-in vectorisation handle things efficiently.

Here's a non-for-loop-dependent solution using the popular dplyr package from the tidyverse.

Read in data

First, let's read in your data for the sake of reproducibility. Note that I've added names to your data, because unnamed data is confusing and hard to work with.

library(vroom) # useful package for flexible data reading

df_1 <- vroom(
"timestamp value
15.942627 2633
15.942630 2664
15.942831 2699
15.943421 3068
15.943422 4256
15.943423 5444
15.943425 6632
15.943426 7820
15.945489 9008
15.945490 10196
15.945995 11384
15.960359 12572
15.960360 13760
15.960413 14948
15.960414 16136
15.961537 17202
15.962138 18390
15.962139 18624
16.042805 18659
16.043349 18851")
#> Rows: 20 Columns: 2
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: " "
#> dbl (2): timestamp, value
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

df_2 <- vroom(
"timestamp value
15.942244 2376
15.942332 2376
15.942332 2376
15.959306 2633
15.960350 2633
15.961223 3068
15.967225 6632
15.978364 10196
15.982280 12572
15.994296 16136
15.994379 18624
16.042336 18624
16.060262 18659
16.065397 21250
16.069239 24814
16.073407 28378
16.077236 31942")
#> Rows: 17 Columns: 2
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: " "
#> dbl (2): timestamp, value
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Comparing time differences for matching values

Let's go through the solution step-by-step:

Add `id` for each row of df_1

We'll need this later to remove unwanted values.

require(dplyr)
#> Loading required package: dplyr

df_1 <- mutate(df_1, id = paste0("id_", row_number()))

df_1 <- relocate(df_1, id)

head(df_1)
#> # A tibble: 6 × 3
#>   id    timestamp value
#>   <chr>     <dbl> <dbl>
#> 1 id_1       15.9  2633
#> 2 id_2       15.9  2664
#> 3 id_3       15.9  2699
#> 4 id_4       15.9  3068
#> 5 id_5       15.9  4256
#> 6 id_6       15.9  5444

Join rows from df_2 on matching values

joined <- left_join(df_1, df_2, by = "value", suffix = c(".1", ".2"))
head(joined)
#> # A tibble: 6 × 4
#>   id    timestamp.1 value timestamp.2
#>   <chr>       <dbl> <dbl>       <dbl>
#> 1 id_1         15.9  2633        16.0
#> 2 id_1         15.9  2633        16.0
#> 3 id_2         15.9  2664        NA  
#> 4 id_3         15.9  2699        NA  
#> 5 id_4         15.9  3068        16.0
#> 6 id_5         15.9  4256        NA

Get the first returned value for each value in df_1

We can do this by grouping by our id column, then just getting the first() row from each group.

joined <- group_by(joined, id) # group by row identifiers
summary <- summarise(joined, across(everything(), first))
head(summary)
#> # A tibble: 6 × 4
#>   id    timestamp.1 value timestamp.2
#>   <chr>       <dbl> <dbl>       <dbl>
#> 1 id_1         15.9  2633        16.0
#> 2 id_10        15.9 10196        16.0
#> 3 id_11        15.9 11384        NA  
#> 4 id_12        16.0 12572        16.0
#> 5 id_13        16.0 13760        NA  
#> 6 id_14        16.0 14948        NA

Get time difference

A simple case of using mutate() to subtract timestamp.1 from timestamp.2:

times <- mutate(summary, time_diff = timestamp.2 - timestamp.1) |>
  relocate(value, .after = id) # this is just for presentation

## You may want to remove rows with no time diff?
filter(times, !is.na(time_diff))
#> # A tibble: 8 × 5
#>   id    value timestamp.1 timestamp.2 time_diff
#>   <chr> <dbl>       <dbl>       <dbl>     <dbl>
#> 1 id_1   2633        15.9        16.0    0.0167
#> 2 id_10 10196        15.9        16.0    0.0329
#> 3 id_12 12572        16.0        16.0    0.0219
#> 4 id_15 16136        16.0        16.0    0.0339
#> 5 id_18 18624        16.0        16.0    0.0322
#> 6 id_19 18659        16.0        16.1    0.0175
#> 7 id_4   3068        15.9        16.0    0.0178
#> 8 id_7   6632        15.9        16.0    0.0238

^{Created on 2022-10-25 with reprex v2.0.2}