Join the two dataframe by conditions-CodePudding

I have a datasets collected by alternative auto-time setting. This is the example format.

file = data.table(ID = c(1:19), time = as.hms("6:01:00 ", "6:03:00 ", 
                                                  "6:05:00 ", "6:08:00 ",
                                                  "6:10:00 ", "6:12:00 ",
                                                  "6:15:00 ", "6:17:00 ",
                                                  "6:19:00 ", "6:22:00 ",
                                                  "6:24:00 ", "6:26:00 ",
                                                  "6:29:00 "), 
                  Turn = c("A", "A", "B", "B", "C", "C", "A", "A", "A", "B", 
                           "B", "C", "C"),
                  Value = c("16.00", "18.00", "17.00", "19.00", "21.00",
                            "26.00", "25.60", "27.34", "29.09", "30.83",
                            "32.57", "34.31", "36.06"))

From this dataset, I separated it into three dataframes for each turn by using the following code:

Data_A <- file[file$Turn %in% c("A"), ]
Data_B <- file[file$Turn %in% c("B"), ]
Data_C <- file[file$Turn %in% c("C"), ]


Data_A
ID  Time    Turn    Value
1   6:01:00 A   16.00
2   6:03:00 A   18.00
7   6:15:00 A   25.60
8   6:17:00 A   27.34
9   6:19:00 A   29.09

Data_C
ID  Time    Turn    Value
5   6:10:00 C   21.00
6   6:12:00 C   26.00
12  6:26:00 C   34.31
13  6:29:00 C   36.06

I want to combine Data_A and Data_C by the nearest time as follow:

ID  Time    Turn    Value Time C Turn C Value C
1   6:01:00 A   16.00  6:10:00  C   21.00
2   6:03:00 A   18.00  6:12:00  C   26.00
7   6:15:00 A   25.60  6:26:00  C   34.31
8   6:17:00 A   27.34  6:29:00  C   36.06
9   6:19:00 A   29.09

As you see, there is an extra A in Data_A. I will exclude it. There are so many such irregular data in the time series dataset. Can anyone help me how to code for this result? Thanks!

CodePudding user response：

In order to combine two frames like this, you need to ensure they both have the same number of rows. This means that (for instance) Data_C will need to have a row of NA values appended. This is fairly straight-forward:

maxrows <- max(nrow(Data_A), nrow(Data_B), nrow(Data_C))
Data_C <- do.call(rbind, c(list(Data_C),
                           replicate(maxrows - nrow(Data_C),
                                     Data_C[1,][NA,],
                                     simplify = FALSE)))
names(Data_C) <- paste0(names(Data_C), "_C")
cbind(Data_A, Data_C)
#       ID     time   Turn  Value  ID_C   time_C Turn_C Value_C
#    <int>    <hms> <char> <char> <int>    <hms> <char>  <char>
# 1:     1 06:01:00      A  16.00     5 06:01:00      C   21.00
# 2:     2 06:01:00      A  18.00     6 06:01:00      C   26.00
# 3:     7 06:01:00      A  25.60    12 06:01:00      C   34.31
# 4:     8 06:01:00      A  27.34    13 06:01:00      C   36.06
# 5:     9 06:01:00      A  29.09    NA       NA   <NA>    <NA>

The use of Data_C[1,][NA,] is meant to get a single row where each value is the appropriate form of NA (there are at least six distinct classes of NA).

Side thoughts:

It's generally better to keep the data all in one frame instead of multiple frames, that way grouping operations are defined in one place and can be done easily using dplyr::group_by, base::ave (and other grouping functions), and data.table's by= argument.
Even if you do split a frame into multiple frames, it is often better to keep them as a list of frames, since what you often do to one is easily done to all using lapply. One could generate this easily with spl <- split(file, file$Turn). From there, reference sdl$C instead of your Data_C.
This might be done more efficiently on file as a pivot.


    library(dplyr)
    library(tidyr)
    file %>%
      group_by(Turn) %>%
      mutate(rn = row_number()) %>%
      ungroup() %>%
      pivot_wider("rn", names_from = "Turn", values_from = c("ID", "time", "Value")) %>%
      select(-rn)
    # # A tibble: 5 × 9
    #    ID_A  ID_B  ID_C time_A time_B time_C Value_A Value_B Value_C
    #   <int> <int> <int> <time> <time> <time> <chr>   <chr>   <chr>  
    # 1     1     3     5 06:01  06:01  06:01  16.00   17.00   21.00  
    # 2     2     4     6 06:01  06:01  06:01  18.00   19.00   26.00  
    # 3     7    10    12 06:01  06:01  06:01  25.60   30.83   34.31  
    # 4     8    11    13 06:01  06:01  06:01  27.34   32.57   36.06  
    # 5     9    NA    NA 06:01     NA     NA  29.09   NA      NA     
    ```

    or with `data.table`:

    ```r
    file[, rn := seq_len(.N), by = Turn]
    dcast(file, rn ~ Turn, value.var = c("ID", "time", "Value"))[, rn := NULL][]
    #     ID_A  ID_B  ID_C   time_A   time_B   time_C Value_A Value_B Value_C
    #    <int> <int> <int>    <hms>    <hms>    <hms>  <char>  <char>  <char>
    # 1:     1     3     5 06:01:00 06:01:00 06:01:00   16.00   17.00   21.00
    # 2:     2     4     6 06:01:00 06:01:00 06:01:00   18.00   19.00   26.00
    # 3:     7    10    12 06:01:00 06:01:00 06:01:00   25.60   30.83   34.31
    # 4:     8    11    13 06:01:00 06:01:00 06:01:00   27.34   32.57   36.06
    # 5:     9    NA    NA 06:01:00       NA       NA   29.09    <NA>    <NA>
    ```

    Ordering of columns should be relatively straight-forward.

CodePudding user response：

You could use data.table's roll='nearest' join option.
In order to do so, time needs to be converted to numeric:

library(data.table)
library(hms)

file = data.table(ID = c(1:19), time = lapply(c("6:01:00", "6:03:00","6:05:00", "6:08:00 ","6:10:00 ", "6:12:00 ", "6:15:00 ", "6:17:00 ", "6:19:00 ", "6:22:00 ",       "6:24:00 ", "6:26:00 ", "6:29:00 "),as.hms), 
                  Turn = c("A", "A", "B", "B", "C", "C", "A", "A", "A", "B", 
                           "B", "C", "C"),
                  Value = c("16.00", "18.00", "17.00", "19.00", "21.00",
                            "26.00", "25.60", "27.34", "29.09", "30.83",
                            "32.57", "34.31", "36.06"))


# numeric join field
file[,itime:=as.numeric(time)]

Data_A <- file[Turn %in% c("A"), ]
Data_B <- file[Turn %in% c("B"), ]
Data_C <- file[Turn %in% c("C"), ]

Data_C[Data_A,.(ID = i.ID,
                time = i.time,
                Value = i.Value,
                Turn = i.Turn,
                time_C = x.time,
                Value_C = x.Value,
                Turn_C = x.Turn),on=.(itime),roll='nearest']
#>       ID     time  Value   Turn   time_C Value_C Turn_C
#>    <int>   <list> <char> <char>   <list>  <char> <char>
#> 1:     1 06:01:00  16.00      A 06:10:00   21.00      C
#> 2:     2 06:03:00  18.00      A 06:10:00   21.00      C
#> 3:     7 06:15:00  25.60      A 06:12:00   26.00      C
#> 4:     8 06:17:00  27.34      A 06:12:00   26.00      C
#> 5:     9 06:19:00  29.09      A 06:12:00   26.00      C
#> 6:    14 06:01:00  16.00      A 06:10:00   21.00      C
#> 7:    15 06:03:00  18.00      A 06:10:00   21.00      C

CodePudding user response：

I would try something in that logic:

First, arrange both dataset by descending order of time, using the arrange() function from the tidyverse packages.

Now that the two datas are arranged the same way,

Join them together according to dataset C using the left_join() function (also from tidyverse).

library(tidyverse)
A = arrange(Data_A, Time)
C = arrange(Data_C, Time)

left_join(A,C)

I tend to confuse left, right or inner join. Please make sure you get your desired result and change to either inner or right (there's also full/anti join, you'll figure out what each does if you play with it. It's basically a venn diagram of datasets).

To use the tidyverse packages use install.packages("tidyverse") if you don't have them installed.