I have a datasets collected by alternative auto-time setting. This is the example format.
file = data.table(ID = c(1:19), time = as.hms("6:01:00 ", "6:03:00 ",
"6:05:00 ", "6:08:00 ",
"6:10:00 ", "6:12:00 ",
"6:15:00 ", "6:17:00 ",
"6:19:00 ", "6:22:00 ",
"6:24:00 ", "6:26:00 ",
"6:29:00 "),
Turn = c("A", "A", "B", "B", "C", "C", "A", "A", "A", "B",
"B", "C", "C"),
Value = c("16.00", "18.00", "17.00", "19.00", "21.00",
"26.00", "25.60", "27.34", "29.09", "30.83",
"32.57", "34.31", "36.06"))
From this dataset, I separated it into three dataframes for each turn by using the following code:
Data_A <- file[file$Turn %in% c("A"), ]
Data_B <- file[file$Turn %in% c("B"), ]
Data_C <- file[file$Turn %in% c("C"), ]
Data_A
ID Time Turn Value
1 6:01:00 A 16.00
2 6:03:00 A 18.00
7 6:15:00 A 25.60
8 6:17:00 A 27.34
9 6:19:00 A 29.09
Data_C
ID Time Turn Value
5 6:10:00 C 21.00
6 6:12:00 C 26.00
12 6:26:00 C 34.31
13 6:29:00 C 36.06
I want to combine Data_A and Data_C by the nearest time as follow:
ID Time Turn Value Time C Turn C Value C
1 6:01:00 A 16.00 6:10:00 C 21.00
2 6:03:00 A 18.00 6:12:00 C 26.00
7 6:15:00 A 25.60 6:26:00 C 34.31
8 6:17:00 A 27.34 6:29:00 C 36.06
9 6:19:00 A 29.09
As you see, there is an extra A in Data_A. I will exclude it. There are so many such irregular data in the time series dataset. Can anyone help me how to code for this result? Thanks!
CodePudding user response:
In order to combine two frames like this, you need to ensure they both have the same number of rows. This means that (for instance) Data_C
will need to have a row of NA
values appended. This is fairly straight-forward:
maxrows <- max(nrow(Data_A), nrow(Data_B), nrow(Data_C))
Data_C <- do.call(rbind, c(list(Data_C),
replicate(maxrows - nrow(Data_C),
Data_C[1,][NA,],
simplify = FALSE)))
names(Data_C) <- paste0(names(Data_C), "_C")
cbind(Data_A, Data_C)
# ID time Turn Value ID_C time_C Turn_C Value_C
# <int> <hms> <char> <char> <int> <hms> <char> <char>
# 1: 1 06:01:00 A 16.00 5 06:01:00 C 21.00
# 2: 2 06:01:00 A 18.00 6 06:01:00 C 26.00
# 3: 7 06:01:00 A 25.60 12 06:01:00 C 34.31
# 4: 8 06:01:00 A 27.34 13 06:01:00 C 36.06
# 5: 9 06:01:00 A 29.09 NA NA <NA> <NA>
The use of Data_C[1,][NA,]
is meant to get a single row where each value is the appropriate form of NA
(there are at least six distinct classes of NA
).
Side thoughts:
It's generally better to keep the data all in one frame instead of multiple frames, that way grouping operations are defined in one place and can be done easily using
dplyr::group_by
,base::ave
(and other grouping functions), anddata.table
'sby=
argument.Even if you do split a frame into multiple frames, it is often better to keep them as a list of frames, since what you often do to one is easily done to all using
lapply
. One could generate this easily withspl <- split(file, file$Turn)
. From there, referencesdl$C
instead of yourData_C
.This might be done more efficiently on
file
as a pivot.
library(dplyr)
library(tidyr)
file %>%
group_by(Turn) %>%
mutate(rn = row_number()) %>%
ungroup() %>%
pivot_wider("rn", names_from = "Turn", values_from = c("ID", "time", "Value")) %>%
select(-rn)
# # A tibble: 5 × 9
# ID_A ID_B ID_C time_A time_B time_C Value_A Value_B Value_C
# <int> <int> <int> <time> <time> <time> <chr> <chr> <chr>
# 1 1 3 5 06:01 06:01 06:01 16.00 17.00 21.00
# 2 2 4 6 06:01 06:01 06:01 18.00 19.00 26.00
# 3 7 10 12 06:01 06:01 06:01 25.60 30.83 34.31
# 4 8 11 13 06:01 06:01 06:01 27.34 32.57 36.06
# 5 9 NA NA 06:01 NA NA 29.09 NA NA
```
or with `data.table`:
```r
file[, rn := seq_len(.N), by = Turn]
dcast(file, rn ~ Turn, value.var = c("ID", "time", "Value"))[, rn := NULL][]
# ID_A ID_B ID_C time_A time_B time_C Value_A Value_B Value_C
# <int> <int> <int> <hms> <hms> <hms> <char> <char> <char>
# 1: 1 3 5 06:01:00 06:01:00 06:01:00 16.00 17.00 21.00
# 2: 2 4 6 06:01:00 06:01:00 06:01:00 18.00 19.00 26.00
# 3: 7 10 12 06:01:00 06:01:00 06:01:00 25.60 30.83 34.31
# 4: 8 11 13 06:01:00 06:01:00 06:01:00 27.34 32.57 36.06
# 5: 9 NA NA 06:01:00 NA NA 29.09 <NA> <NA>
```
Ordering of columns should be relatively straight-forward.
CodePudding user response:
You could use data.table's roll='nearest'
join option.
In order to do so, time
needs to be converted to numeric:
library(data.table)
library(hms)
file = data.table(ID = c(1:19), time = lapply(c("6:01:00", "6:03:00","6:05:00", "6:08:00 ","6:10:00 ", "6:12:00 ", "6:15:00 ", "6:17:00 ", "6:19:00 ", "6:22:00 ", "6:24:00 ", "6:26:00 ", "6:29:00 "),as.hms),
Turn = c("A", "A", "B", "B", "C", "C", "A", "A", "A", "B",
"B", "C", "C"),
Value = c("16.00", "18.00", "17.00", "19.00", "21.00",
"26.00", "25.60", "27.34", "29.09", "30.83",
"32.57", "34.31", "36.06"))
# numeric join field
file[,itime:=as.numeric(time)]
Data_A <- file[Turn %in% c("A"), ]
Data_B <- file[Turn %in% c("B"), ]
Data_C <- file[Turn %in% c("C"), ]
Data_C[Data_A,.(ID = i.ID,
time = i.time,
Value = i.Value,
Turn = i.Turn,
time_C = x.time,
Value_C = x.Value,
Turn_C = x.Turn),on=.(itime),roll='nearest']
#> ID time Value Turn time_C Value_C Turn_C
#> <int> <list> <char> <char> <list> <char> <char>
#> 1: 1 06:01:00 16.00 A 06:10:00 21.00 C
#> 2: 2 06:03:00 18.00 A 06:10:00 21.00 C
#> 3: 7 06:15:00 25.60 A 06:12:00 26.00 C
#> 4: 8 06:17:00 27.34 A 06:12:00 26.00 C
#> 5: 9 06:19:00 29.09 A 06:12:00 26.00 C
#> 6: 14 06:01:00 16.00 A 06:10:00 21.00 C
#> 7: 15 06:03:00 18.00 A 06:10:00 21.00 C
CodePudding user response:
I would try something in that logic:
First, arrange both dataset by descending order of time, using the arrange() function from the tidyverse packages.
Now that the two datas are arranged the same way,
Join them together according to dataset C using the left_join() function (also from tidyverse).
library(tidyverse)
A = arrange(Data_A, Time)
C = arrange(Data_C, Time)
left_join(A,C)
I tend to confuse left, right or inner join. Please make sure you get your desired result and change to either inner or right (there's also full/anti join, you'll figure out what each does if you play with it. It's basically a venn diagram of datasets).
To use the tidyverse packages use install.packages("tidyverse") if you don't have them installed.