I have two data frames containing row entries with respective dates. Data frame 1 contains observations collected from 2010 to 2017.
dates A
2010-01-01 21
2010-01-02 27
2010-01-03 34
...
2017-12-29 22
2017-12-30 32
2017-12-31 25
Data frame 2 contains observations collected from 2015 to 2020.
dates A
2015-01-01 20
2015-01-02 29
2015-01-03 34
...
2020-12-29 22
2020-12-30 27
2020-12-31 32
Both the data frames have missing observations for some days. I wish to combine both data frames to fill out missing data and obtain complete time series upto 2020 without any repeated entries. Like the following data frame:
dates A
2010-01-01 21
2010-01-02 27
2010-01-03 34
...
2020-12-29 22
2020-12-30 27
2020-12-31 32
Using merge(df1, df2, by = 'dates')
or full_join(df1, df2, by = 'dates')
creates duplicate entries or two columns A.x
and A.y
which is not expected.
CodePudding user response:
If your df
is really just two columns, you should be able to bind_rows
, group_by
, and distinct
to remove duplicates.
library(dplyr)
df <- bind_rows(df1, df2) %>%
group_by(dates, A) %>%
distinct(dates)
Edit: This will not work if you have data that doesn't agree between the dataframes on a single date. If you have two records for 1/1/15 and they have different A
values, those will both be retained.
CodePudding user response:
Try the code below
dfout <- unique(rbind(df1,df2))
dfout <- dfout[order(dfout$dates),]
CodePudding user response:
Combine df1
and df2
, if there are duplicate dates which are available in both the dataframes sum
the A
value and use complete
to fill the missing dates.
library(dplyr)
library(tidyr)
df1 %>%
bind_rows(df2) %>%
mutate(dates = as.Date(dates)) %>%
group_by(dates) %>%
summarise(A = sum(A)) %>%
complete(dates = seq(min(date), max(date), by = 'day'))