I am working with a large time series of oceanographic data which needs a lot of manipulation. I have several days of data missing and would like to interpolate them. Specifically date/depth/temperature. Here is an example of my df:
> tibble(df)
# A tibble: 351,685 x 9
date time depthR SV temp salinity conduct density calcSV
<date> <times> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2021-11-17 07:50:18 0.5 1524. 19.7 37.8 51.0 27 1524.
2 2021-11-17 07:50:22 0.5 1524. 19.9 37.6 50.9 26.8 1524.
3 2021-11-17 07:50:23 1.1 1524. 19.9 37.6 50.9 26.8 1524.
4 2021-11-17 07:50:24 1.5 1524. 19.9 37.6 50.9 26.8 1524.
5 2021-11-17 07:50:25 2 1524. 19.9 37.6 50.9 26.8 1524.
Each date contains over 1000 lines of data and so my idea was to find the max depth of each day to therefore interpolate reasonable max depth values for the missing days between. So far, I have found the max depth per date:
group <- df %>% group_by(date) %>% summarise(max =max(depthR, na.rm=TRUE))
> tibble(group)
# A tibble: 40 x 2
date max
<date> <dbl>
1 2021-11-17 685.
2 2021-11-18 695.
3 2021-11-19 136.
4 2021-11-20 138.
5 2021-11-21 142.
6 2021-11-22 26
7 2021-11-23 136.
8 2021-11-24 297.
9 2021-11-25 613.
10 2021-11-26 81.1
# ... with 30 more rows
And then I managed to interpolate the missing dates by:
> group <- seq(min(group$date), max(group$date), by = "1 day")
> group <- data.frame(date=group)
> tibble(group)
# A tibble: 69 x 1
date
<date>
1 2021-11-17
2 2021-11-18
3 2021-11-19
4 2021-11-20
5 2021-11-21
6 2021-11-22
7 2021-11-23
8 2021-11-24
9 2021-11-25
10 2021-11-26
# ... with 59 more rows
As you can see, the previous query was overwritten. So I tried creating a new df for the interpolated dates and tried merging them together. I got the error:
> library(stringr)
> group$combined <- str_c(group$date, '', dateinterp$date)
Error: Assigned data `str_c(group$date, "", dateinterp$date)` must be compatible with existing data.
x Existing data has 40 rows.
x Assigned data has 69 rows.
i Only vectors of size 1 are recycled.
How can I insert these two matrices of differing length into the dataframe in chronological order without overwriting original data or conflicting? Following that, I'm not sure how I would proceed to interpolate the depths and temperatures for each date. Perhaps starting with something like the following:
depth = seq(1, 200, length.out = 100))
Eventually the date variable will be exchanged for geo coords. Any advice greatly appreciated.
EDIT: As requested by @AndreaM, an example of my data:
> dput(head(df))
structure(list(date = structure(c(18948, 18948, 18948, 18948,
18948, 18948), class = "Date"), time = structure(c(0.326597222222222,
0.326643518518519, 0.326655092592593, 0.326666666666667, 0.326678240740741,
0.326712962962963), format = "h:m:s", class = "times"), depth = c(0.5,
0.5, 1.1, 1.5, 2, 2.5), SV = c(1524.024, 1524.026, 1524.025,
1524.008, 1524.016, 1524.084), temp = c(19.697, 19.864, 19.852,
19.854, 19.856, 19.847), salinity = c(37.823, 37.561, 37.557,
37.568, 37.573, 37.704), conduct = c(51.012, 50.878, 50.86, 50.876,
50.884, 51.032), density = c(27, 26.755, 26.758, 26.768, 26.773,
26.877), calcSV = c(1523.811, 1523.978, 1523.949, 1523.975, 1523.993,
1524.124)), row.names = 100838:100843, class = "data.frame")
CodePudding user response:
one approach, adapt to your case as appropriate:
library(dplyr)
library(lubridate) ## facilitates date-time manipulations
## example data:
patchy_data <- data.frame(date = as.Date('2021-11-01') sample(1:10, 6),
value = rnorm(12)) %>%
arrange(date)
## create vector of -only!- missing dates:
missing_dates <-
setdiff(
seq.Date(from = min(patchy_data$date),
to = max(patchy_data$date),
by = '1 day'
),
patchy_data$date
) %>% as.Date(origin = '1970-01-01')
## extend initial dataframe with rows per missing date:
full_data <-
patchy_data %>%
bind_rows(data.frame(date = missing_dates,
value = NA)
) %>%
arrange(date)
## group by month and impute missing data from monthwise statistic:
full_data %>%
mutate(month = lubridate::month(date)) %>%
group_by(month) %>%
## coalesce conveniently replaces ifelse-constructs to replace NAs
mutate(imputed = coalesce(.$value, mean(.$value, na.rm = TRUE)))
CodePudding user response:
To me it is not clear the output but let's start with this.
If you like the tidy approach, you can manage it this way all the columns in one dplyr
chain, after creating the full dates range:
# first all the dates
dates <- data.frame(date = seq(min(as.Date(df$date)),max(as.Date(df$date)), by =1))
df %>%
# convert date as.date
mutate(date = as.Date(date)) %>%
# join the full calendar
right_join(dates, on = 'date') %>%
# from wide to long format
pivot_longer(!c(date,time),names_to ='variables') %>%
# group to
group_by(variables) %>%
# calculate the max by date
mutate(max_value = max(value, na.rm = T)) %>%
# replace nas with the max
mutate(value_filled = ifelse(is.na(value), max_value, value)) %>%
# remove useless columns
select(-c(max_value,value)) %>%
# put in a long
pivot_wider(names_from =variables, values_from = value_filled)
With result:
Joining, by = "date"
# A tibble: 6 x 9
date time depthR SV temp salinity conduct density calcSV
<date> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2021-11-17 07:50:18 0.5 1524 19.7 37.8 51 27 1524
2 2021-11-17 07:50:22 0.5 1524 19.9 37.6 50.9 26.8 1524
3 2021-11-17 07:50:23 1.1 1524 19.9 37.6 50.9 26.8 1524
4 2021-11-17 07:50:24 1.5 1524 19.9 37.6 50.9 26.8 1524
5 2021-11-19 07:50:25 2 1524 19.9 37.6 50.9 26.8 1524
6 2021-11-18 NA 2 1524 19.9 37.8 51 27 1524
Obviously I've modded the data you give us, adding a different day (last row):
df <- read.table(text = "
date time depthR SV temp salinity conduct density calcSV
1 2021-11-17 07:50:18 0.5 1524. 19.7 37.8 51.0 27 1524.
2 2021-11-17 07:50:22 0.5 1524. 19.9 37.6 50.9 26.8 1524.
3 2021-11-17 07:50:23 1.1 1524. 19.9 37.6 50.9 26.8 1524.
4 2021-11-17 07:50:24 1.5 1524. 19.9 37.6 50.9 26.8 1524.
5 2021-11-19 07:50:25 2 1524. 19.9 37.6 50.9 26.8 1524.", header = T)