Home > Blockchain >  Interpolate and insert missing rows into dataframe R
Interpolate and insert missing rows into dataframe R

Time:05-03

I am working with a large time series of oceanographic data which needs a lot of manipulation. I have several days of data missing and would like to interpolate them. Specifically date/depth/temperature. Here is an example of my df:

> tibble(df)
# A tibble: 351,685 x 9
   date       time     depthR    SV  temp salinity conduct density calcSV
   <date>     <times>  <dbl> <dbl> <dbl>    <dbl>   <dbl>   <dbl>  <dbl>
 1 2021-11-17 07:50:18   0.5 1524.  19.7     37.8    51.0    27    1524.
 2 2021-11-17 07:50:22   0.5 1524.  19.9     37.6    50.9    26.8  1524.
 3 2021-11-17 07:50:23   1.1 1524.  19.9     37.6    50.9    26.8  1524.
 4 2021-11-17 07:50:24   1.5 1524.  19.9     37.6    50.9    26.8  1524.
 5 2021-11-17 07:50:25   2   1524.  19.9     37.6    50.9    26.8  1524.

Each date contains over 1000 lines of data and so my idea was to find the max depth of each day to therefore interpolate reasonable max depth values for the missing days between. So far, I have found the max depth per date:

group <- df %>% group_by(date) %>% summarise(max =max(depthR, na.rm=TRUE))
> tibble(group)
# A tibble: 40 x 2
   date         max
   <date>     <dbl>
 1 2021-11-17 685. 
 2 2021-11-18 695. 
 3 2021-11-19 136. 
 4 2021-11-20 138. 
 5 2021-11-21 142. 
 6 2021-11-22  26  
 7 2021-11-23 136. 
 8 2021-11-24 297. 
 9 2021-11-25 613. 
10 2021-11-26  81.1
# ... with 30 more rows

And then I managed to interpolate the missing dates by:

> group <- seq(min(group$date), max(group$date), by = "1 day")
> group <- data.frame(date=group)
> tibble(group)
# A tibble: 69 x 1
   date      
   <date>    
 1 2021-11-17
 2 2021-11-18
 3 2021-11-19
 4 2021-11-20
 5 2021-11-21
 6 2021-11-22
 7 2021-11-23
 8 2021-11-24
 9 2021-11-25
10 2021-11-26
# ... with 59 more rows

As you can see, the previous query was overwritten. So I tried creating a new df for the interpolated dates and tried merging them together. I got the error:

> library(stringr)
> group$combined <-  str_c(group$date, '', dateinterp$date)
Error: Assigned data `str_c(group$date, "", dateinterp$date)` must be compatible with existing data.
x Existing data has 40 rows.
x Assigned data has 69 rows.
i Only vectors of size 1 are recycled.

How can I insert these two matrices of differing length into the dataframe in chronological order without overwriting original data or conflicting? Following that, I'm not sure how I would proceed to interpolate the depths and temperatures for each date. Perhaps starting with something like the following:

depth = seq(1, 200, length.out = 100))

Eventually the date variable will be exchanged for geo coords. Any advice greatly appreciated.

EDIT: As requested by @AndreaM, an example of my data:

> dput(head(df))
structure(list(date = structure(c(18948, 18948, 18948, 18948, 
18948, 18948), class = "Date"), time = structure(c(0.326597222222222, 
0.326643518518519, 0.326655092592593, 0.326666666666667, 0.326678240740741, 
0.326712962962963), format = "h:m:s", class = "times"), depth = c(0.5, 
0.5, 1.1, 1.5, 2, 2.5), SV = c(1524.024, 1524.026, 1524.025, 
1524.008, 1524.016, 1524.084), temp = c(19.697, 19.864, 19.852, 
19.854, 19.856, 19.847), salinity = c(37.823, 37.561, 37.557, 
37.568, 37.573, 37.704), conduct = c(51.012, 50.878, 50.86, 50.876, 
50.884, 51.032), density = c(27, 26.755, 26.758, 26.768, 26.773, 
26.877), calcSV = c(1523.811, 1523.978, 1523.949, 1523.975, 1523.993, 
1524.124)), row.names = 100838:100843, class = "data.frame")

CodePudding user response:

one approach, adapt to your case as appropriate:

library(dplyr)
library(lubridate) ## facilitates date-time manipulations

## example data:
patchy_data <- data.frame(date = as.Date('2021-11-01')   sample(1:10, 6),
                          value = rnorm(12)) %>%
    arrange(date)

## create vector of -only!- missing dates:
missing_dates <- 
    setdiff(
        seq.Date(from = min(patchy_data$date),
                 to = max(patchy_data$date),
                 by = '1 day'
                 ),
        patchy_data$date
    ) %>% as.Date(origin = '1970-01-01')

## extend initial dataframe with rows per missing date:
full_data <-
    patchy_data %>%
        bind_rows(data.frame(date = missing_dates,
                             value = NA)
                  ) %>%
        arrange(date)

## group by month and impute missing data from monthwise statistic:
full_data %>%
    mutate(month = lubridate::month(date)) %>%
    group_by(month) %>%
    ## coalesce conveniently replaces ifelse-constructs to replace NAs
    mutate(imputed = coalesce(.$value, mean(.$value, na.rm = TRUE)))

CodePudding user response:

To me it is not clear the output but let's start with this. If you like the tidy approach, you can manage it this way all the columns in one dplyr chain, after creating the full dates range:

  # first all the dates
 dates <- data.frame(date = seq(min(as.Date(df$date)),max(as.Date(df$date)), by =1))


df %>%
    # convert date as.date
    mutate(date = as.Date(date)) %>%
    # join the full calendar
    right_join(dates, on = 'date') %>%
    # from wide to long format
    pivot_longer(!c(date,time),names_to ='variables') %>%
    # group to
    group_by(variables) %>%
    # calculate the max by date
    mutate(max_value = max(value, na.rm = T)) %>%
    # replace nas with the max
    mutate(value_filled = ifelse(is.na(value), max_value, value)) %>%
    # remove useless columns
    select(-c(max_value,value)) %>%
    # put in a long
    pivot_wider(names_from =variables, values_from = value_filled)

With result:

Joining, by = "date"
# A tibble: 6 x 9
  date       time     depthR    SV  temp salinity conduct density calcSV
  <date>     <chr>     <dbl> <dbl> <dbl>    <dbl>   <dbl>   <dbl>  <dbl>
1 2021-11-17 07:50:18    0.5  1524  19.7     37.8    51      27     1524
2 2021-11-17 07:50:22    0.5  1524  19.9     37.6    50.9    26.8   1524
3 2021-11-17 07:50:23    1.1  1524  19.9     37.6    50.9    26.8   1524
4 2021-11-17 07:50:24    1.5  1524  19.9     37.6    50.9    26.8   1524
5 2021-11-19 07:50:25    2    1524  19.9     37.6    50.9    26.8   1524
6 2021-11-18 NA          2    1524  19.9     37.8    51      27     1524

Obviously I've modded the data you give us, adding a different day (last row):

df <- read.table(text = "
   date       time     depthR    SV  temp salinity conduct density calcSV
 1 2021-11-17 07:50:18   0.5 1524.  19.7     37.8    51.0    27    1524.
 2 2021-11-17 07:50:22   0.5 1524.  19.9     37.6    50.9    26.8  1524.
 3 2021-11-17 07:50:23   1.1 1524.  19.9     37.6    50.9    26.8  1524.
 4 2021-11-17 07:50:24   1.5 1524.  19.9     37.6    50.9    26.8  1524.
 5 2021-11-19 07:50:25   2   1524.  19.9     37.6    50.9    26.8  1524.", header = T)
  • Related