I am trying to merge two datasets together based on dates
My objective is to get the data for each date (including those where some data will be missing), so under the format 01/01/2021 02/01/2021 ...
However, using the Merge function leads to most of the data becoming NA
jointdataset <- merge(group_df, group_tweet, by = 'date', all.x= TRUE)
structure(list(date = c("01/01/2021", "01/05/2021", "01/06/2021",
"01/07/2021", "01/11/2021", "01/12/2021"), `length(category)` = c(4L,
8L, 4L, 4L, 4L, 12L), `sum(usd_pledged)` = c(50278.64, 366279.590415302,
172073.0471292, 230.537553792, 304353.5676352, 285277.861423738
), `sum(backers_count)` = c(2880L, 6588L, 3528L, 16L, 4204L,
6632L), `length(text)` = c(NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_), `sum(public_metrics.x$reply_count)` = c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_
), `sum(public_metrics.x$like_count)` = c(NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_), `sum(public_metrics.x$retweet_count)` = c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_
)), row.names = c(NA, 6L), class = "data.frame")
Here are samples of both of my datasets
dput(head(group_df))
structure(list(date = c("01/01/2021", "01/05/2021", "01/06/2021",
"01/07/2021", "01/11/2021", "01/12/2021"), `length(category)` = c(4L,
8L, 4L, 4L, 4L, 12L), `sum(usd_pledged)` = c(50278.64, 366279.590415302,
172073.0471292, 230.537553792, 304353.5676352, 285277.861423738
), `sum(backers_count)` = c(2880L, 6588L, 3528L, 16L, 4204L,
6632L)), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"
))
dput(head(group_tweet))
structure(list(date = structure(c(18628, 18629, 18630, 18631,
18632, 18633), class = "Date"), `length(text)` = c(1324L, 1548L,
1297L, 1585L, 1636L, 1583L), `sum(public_metrics.x$reply_count)` = c(882L,
1252L, 910L, 1018L, 810L, 1000L), `sum(public_metrics.x$like_count)` = c(22859L,
24375L, 17854L, 20341L, 19521L, 19401L), `sum(public_metrics.x$retweet_count)` = c(8621L,
8239L, 6141L, 6728L, 6938L, 6842L)), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))
Is there any way for this data to not become NA ?
CodePudding user response:
You could use full_join
from dplyr
. Your dates columns should have the same type (Thanks to @TarJae!). You can use the following code:
library(dplyr)
library(lubridate)
group_df %>%
mutate(date = mdy(date)) %>%
full_join(group_tweet) %>%
arrange(date)
#> Joining, by = "date"
#> # A tibble: 9 × 8
#> date `length(category)` `sum(usd_pledg…` `sum(backers_c…` `length(text)`
#> <date> <int> <dbl> <int> <int>
#> 1 2021-01-01 4 50279. 2880 1324
#> 2 2021-01-02 NA NA NA 1548
#> 3 2021-01-03 NA NA NA 1297
#> 4 2021-01-04 NA NA NA 1585
#> 5 2021-01-05 8 366280. 6588 1636
#> 6 2021-01-06 4 172073. 3528 1583
#> 7 2021-01-07 4 231. 16 NA
#> 8 2021-01-11 4 304354. 4204 NA
#> 9 2021-01-12 12 285278. 6632 NA
#> # … with 3 more variables: `sum(public_metrics.x$reply_count)` <int>,
#> # `sum(public_metrics.x$like_count)` <int>,
#> # `sum(public_metrics.x$retweet_count)` <int>
Created on 2022-07-09 by the reprex package (v2.0.1)
CodePudding user response:
Thank you for all your help
Here is how i managed to fix it: Using the code given to me by @Quinten and @TarJae, I mutated the date by itself:
group_df = group_df %>%
mutate(date = mdy(date))
And afterwards i changed the date in group df to a character
group_df$date <- as.character(group_df$date)
Following this, a left join was enough
jointdataset <- dplyr::left_join(group_df, group_tweet)
> dput(head(jointdataset))
structure(list(date = c("2021-01-01", "2021-01-05", "2021-01-06",
"2021-01-07", "2021-01-11", "2021-01-12"), Kickstarter = c(4L,
8L, 4L, 4L, 4L, 12L), Money = c(50278.64, 366279.590415302, 172073.0471292,
230.537553792, 304353.5676352, 285277.861423738), Backers = c(2880L,
6588L, 3528L, 16L, 4204L, 6632L), Twitter = c(1324L, 1636L, 1583L,
1492L, 1718L, 1781L), replies = c(882L, 810L, 1000L, 1050L, 1376L,
1387L), likes = c(22859L, 19521L, 19401L, 19875L, 24307L, 28284L
), retweets = c(8621L, 6938L, 6842L, 6875L, 7807L, 10235L)), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
Only issue i have now is that the new dataset will not show me data for dates where there is data for only one of the datasets but that should be an easy fix