How to collate two time series dataframes that do not have matching times and contain different colu-CodePudding

I have two dataframes

df1 <- setNames(data.frame(c(as.POSIXct("2022-07-29 00:00:00","2022-07-29 00:05:00","2022-07-29 00:10:00","2022-07-29 00:15:00","2022-07-29 00:20:00")), c(1,2,3,4,5)), c("timeStamp", "value"))
df2 <- setNames(data.frame(c(as.POSIXct("2022-07-29 00:00:05","2022-07-29 00:05:05","2022-07-29 00:20:05")), c("a","b","c")), c("timeStamp", "text"))

I want to merge them into a single data frame by timeStamp, where the times are matched in chronological order, and the mismatching columns are simply filled in with NA. So it would read

timestamp              value     text
"2022-07-29 00:00:00"  1         NA
"2022-07-29 00:00:05"  NA        a
...

I can't for my life figure out why simply merge(x,y,by=) doesn't work for this. neither does left_join. It seems so simple and I can't seem to figure it out.

Also, a bonus would be to make it work this way as well, in which each time there is a change in text, all the rows below it remain the last text until the text changes again prompted by df2:

timestamp              value     text
"2022-07-29 00:00:00"  1         NA
"2022-07-29 00:00:05"  NA        a
"2022-07-29 00:05:00"  2         a
"2022-07-29 00:05:05"  NA        b
"2022-07-29 00:10:00"  3         b
"2022-07-29 00:15:00"  4         b
"2022-07-29 00:20:00"  5         b
"2022-07-29 00:20:05"  NA        c

Thanks

CodePudding user response：

You need to do a full join.

tidyverse solution:

df1 %>% 
  full_join(df2, c("timeStamp" = "timeStamp")) %>% 
  arrange(timeStamp) %>% 
  tidyr::fill(text)

            timeStamp value text
1 2022-07-29 00:00:00     1 <NA>
2 2022-07-29 00:00:05    NA    a
3 2022-07-29 00:05:00     2    a
4 2022-07-29 00:05:05    NA    b
5 2022-07-29 00:10:00     3    b
6 2022-07-29 00:15:00     4    b
7 2022-07-29 00:20:00     5    b
8 2022-07-29 00:20:05    NA    c

Base R merge:

merge(df1, df2, by = c("timeStamp" = "timeStamp"), all = T)

CodePudding user response：

Using the corrected definitions of df1 and df2 in the Note at the end perform a merge using all = TRUE and then use na.locf0 to fill in the NA's.

library(zoo)

m <- merge(df1, df2, all = TRUE) |> transform(text = na.locf0(text)); m
##             timeStamp value text
## 1 2022-07-29 00:00:00     1 <NA>
## 2 2022-07-29 00:00:05    NA    a
## 3 2022-07-29 00:05:00     2    a
## 4 2022-07-29 00:05:05    NA    b
## 5 2022-07-29 00:10:00     3    b
## 6 2022-07-29 00:15:00     4    b
## 7 2022-07-29 00:20:00     5    b
## 8 2022-07-29 00:20:05    NA    c

If the different values of text actually refer to different series that should be put in separate columns then we can create the following zoo series. (Use fortify.zoo(z) if you need to convert that to a data frame.)

z <- read.zoo(m, split = "text"); z
##                      a  b  c
## 2022-07-29 00:00:05 NA NA NA
## 2022-07-29 00:05:00  2 NA NA
## 2022-07-29 00:05:05 NA NA NA
## 2022-07-29 00:10:00 NA  3 NA
## 2022-07-29 00:15:00 NA  4 NA
## 2022-07-29 00:20:00 NA  5 NA
## 2022-07-29 00:20:05 NA NA NA

Note

df1 <- data.frame(timeStamp = as.POSIXct(c("2022-07-29 00:00:00",
"2022-07-29 00:05:00", "2022-07-29 00:10:00","2022-07-29 00:15:00",
"2022-07-29 00:20:00")), value = c(1,2,3,4,5))

df2 <- data.frame(timeStamp = as.POSIXct(c("2022-07-29 00:00:05",
"2022-07-29 00:05:05", "2022-07-29 00:20:05")), 
text = c("a","b","c"))