Merging Dataset without loosing values-CodePudding

I am trying to merge two datasets together based on dates

My objective is to get the data for each date (including those where some data will be missing), so under the format 01/01/2021 02/01/2021 ...

However, using the Merge function leads to most of the data becoming NA

jointdataset <- merge(group_df, group_tweet, by = 'date', all.x= TRUE)

structure(list(date = c("01/01/2021", "01/05/2021", "01/06/2021", 
"01/07/2021", "01/11/2021", "01/12/2021"), `length(category)` = c(4L, 
8L, 4L, 4L, 4L, 12L), `sum(usd_pledged)` = c(50278.64, 366279.590415302, 
172073.0471292, 230.537553792, 304353.5676352, 285277.861423738
), `sum(backers_count)` = c(2880L, 6588L, 3528L, 16L, 4204L, 
6632L), `length(text)` = c(NA_integer_, NA_integer_, NA_integer_, 
NA_integer_, NA_integer_, NA_integer_), `sum(public_metrics.x$reply_count)` = c(NA_integer_, 
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_
), `sum(public_metrics.x$like_count)` = c(NA_integer_, NA_integer_, 
NA_integer_, NA_integer_, NA_integer_, NA_integer_), `sum(public_metrics.x$retweet_count)` = c(NA_integer_, 
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_
)), row.names = c(NA, 6L), class = "data.frame")

Here are samples of both of my datasets

dput(head(group_df))
structure(list(date = c("01/01/2021", "01/05/2021", "01/06/2021", 
"01/07/2021", "01/11/2021", "01/12/2021"), `length(category)` = c(4L, 
8L, 4L, 4L, 4L, 12L), `sum(usd_pledged)` = c(50278.64, 366279.590415302, 
172073.0471292, 230.537553792, 304353.5676352, 285277.861423738
), `sum(backers_count)` = c(2880L, 6588L, 3528L, 16L, 4204L, 
6632L)), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"
))

dput(head(group_tweet))
structure(list(date = structure(c(18628, 18629, 18630, 18631, 
18632, 18633), class = "Date"), `length(text)` = c(1324L, 1548L, 
1297L, 1585L, 1636L, 1583L), `sum(public_metrics.x$reply_count)` = c(882L, 
1252L, 910L, 1018L, 810L, 1000L), `sum(public_metrics.x$like_count)` = c(22859L, 
24375L, 17854L, 20341L, 19521L, 19401L), `sum(public_metrics.x$retweet_count)` = c(8621L, 
8239L, 6141L, 6728L, 6938L, 6842L)), row.names = c(NA, -6L), class = c("tbl_df", 
"tbl", "data.frame"))

Is there any way for this data to not become NA ?

CodePudding user response：

You could use full_join from dplyr. Your dates columns should have the same type (Thanks to @TarJae!). You can use the following code:

library(dplyr)
library(lubridate)
group_df %>% 
  mutate(date = mdy(date)) %>%
  full_join(group_tweet) %>% 
  arrange(date)
#> Joining, by = "date"
#> # A tibble: 9 × 8
#>   date       `length(category)` `sum(usd_pledg…` `sum(backers_c…` `length(text)`
#>   <date>                  <int>            <dbl>            <int>          <int>
#> 1 2021-01-01                  4           50279.             2880           1324
#> 2 2021-01-02                 NA              NA                NA           1548
#> 3 2021-01-03                 NA              NA                NA           1297
#> 4 2021-01-04                 NA              NA                NA           1585
#> 5 2021-01-05                  8          366280.             6588           1636
#> 6 2021-01-06                  4          172073.             3528           1583
#> 7 2021-01-07                  4             231.               16             NA
#> 8 2021-01-11                  4          304354.             4204             NA
#> 9 2021-01-12                 12          285278.             6632             NA
#> # … with 3 more variables: `sum(public_metrics.x$reply_count)` <int>,
#> #   `sum(public_metrics.x$like_count)` <int>,
#> #   `sum(public_metrics.x$retweet_count)` <int>

^{Created on 2022-07-09 by the reprex package (v2.0.1)}

CodePudding user response：

Thank you for all your help

Here is how i managed to fix it: Using the code given to me by @Quinten and @TarJae, I mutated the date by itself:

group_df = group_df %>% 
  mutate(date = mdy(date))

And afterwards i changed the date in group df to a character

group_df$date <- as.character(group_df$date)

Following this, a left join was enough

jointdataset <- dplyr::left_join(group_df, group_tweet)

> dput(head(jointdataset))
structure(list(date = c("2021-01-01", "2021-01-05", "2021-01-06", 
"2021-01-07", "2021-01-11", "2021-01-12"), Kickstarter = c(4L, 
8L, 4L, 4L, 4L, 12L), Money = c(50278.64, 366279.590415302, 172073.0471292, 
230.537553792, 304353.5676352, 285277.861423738), Backers = c(2880L, 
6588L, 3528L, 16L, 4204L, 6632L), Twitter = c(1324L, 1636L, 1583L, 
1492L, 1718L, 1781L), replies = c(882L, 810L, 1000L, 1050L, 1376L, 
1387L), likes = c(22859L, 19521L, 19401L, 19875L, 24307L, 28284L
), retweets = c(8621L, 6938L, 6842L, 6875L, 7807L, 10235L)), row.names = c(NA, 
-6L), class = c("tbl_df", "tbl", "data.frame"))

Only issue i have now is that the new dataset will not show me data for dates where there is data for only one of the datasets but that should be an easy fix