Home > Enterprise >  Merging datasets leads to columns data transforming into NA
Merging datasets leads to columns data transforming into NA

Time:07-22

I am trying to merge two datasets based on dates

However, the merge function leads to the "duration_days" part to be transformed into NA

Any way to fix that ?

> dput(head(ksM))
structure(list(date = c("01/01/2021", "01/05/2021", "01/06/2021", 
"01/07/2021", "01/11/2021", "01/12/2021"), `projects launched` = c(4L, 
8L, 4L, 4L, 4L, 12L), pledged = c(50278.64, 366279.590415302, 
172073.0471292, 230.537553792, 304353.5676352, 285277.861423738
), backers = c(2880L, 6588L, 3528L, 16L, 4204L, 6632L), `total goal` = c(24000, 
148000, 60000, 4000, 140000, 198000), `mean goal` = c(6000, 18500, 
15000, 1000, 35000, 16500), US = c(4L, 4L, NA, NA, NA, 8L), `number of success` = c(4L, 
8L, 4L, 4L, 4L, 12L)), row.names = c(NA, 6L), class = "data.frame")

> dput(head(group_duration))
structure(list(date = structure(c(18628, 18632, 18633, 18634, 
18638, 18639), class = "Date"), duration_days = c(30, 31, 30, 
53, 30, 42.6666666666667)), row.names = c(NA, -6L), class = c("tbl_df", 
"tbl", "data.frame"))

test <- merge(ksM, group_duration, by.x = "date", by.y = "date", all.x= TRUE )

> dput(head(test))
structure(list(date = c("01/01/2021", "01/05/2021", "01/06/2021", 
"01/07/2021", "01/11/2021", "01/12/2021"), `projects launched` = c(4L, 
8L, 4L, 4L, 4L, 12L), pledged = c(50278.64, 366279.590415302, 
172073.0471292, 230.537553792, 304353.5676352, 285277.861423738
), backers = c(2880L, 6588L, 3528L, 16L, 4204L, 6632L), `total goal` = c(24000, 
148000, 60000, 4000, 140000, 198000), `mean goal` = c(6000, 18500, 
15000, 1000, 35000, 16500), US = c(4L, 4L, NA, NA, NA, 8L), `number of success` = c(4L, 
8L, 4L, 4L, 4L, 12L), duration_days = c(NA_real_, NA_real_, NA_real_, 
NA_real_, NA_real_, NA_real_)), row.names = c(NA, 6L), class = "data.frame")

CodePudding user response:

Your date column in ksM is a character vector. You need to convert it to date:

ksM$date  <- as.Date(ksM$date, format = "%m/%d/%Y")

merge(ksM, group_duration, by = "date", all.x= TRUE )
#         date projects launched     pledged backers total goal mean goal US number of success duration_days
# 1 2021-01-01                 4  50278.6400    2880      24000      6000  4                 4      30.00000
# 2 2021-01-05                 8 366279.5904    6588     148000     18500  4                 8      31.00000
# 3 2021-01-06                 4 172073.0471    3528      60000     15000 NA                 4      30.00000
# 4 2021-01-07                 4    230.5376      16       4000      1000 NA                 4      53.00000
# 5 2021-01-11                 4 304353.5676    4204     140000     35000 NA                 4      30.00000
# 6 2021-01-12                12 285277.8614    6632     198000     16500  8                12      42.66667

CodePudding user response:

Your days columns have different format. You can change that by using as.Date with the right format and join them using a left_join like this:

ksM <- structure(list(date = c("01/01/2021", "01/05/2021", "01/06/2021", 
                               "01/07/2021", "01/11/2021", "01/12/2021"), `projects launched` = c(4L, 
                                                                                                  8L, 4L, 4L, 4L, 12L), pledged = c(50278.64, 366279.590415302, 
                                                                                                                                    172073.0471292, 230.537553792, 304353.5676352, 285277.861423738
                                                                                                  ), backers = c(2880L, 6588L, 3528L, 16L, 4204L, 6632L), `total goal` = c(24000, 
                                                                                                                                                                           148000, 60000, 4000, 140000, 198000), `mean goal` = c(6000, 18500, 
                                                                                                                                                                                                                                 15000, 1000, 35000, 16500), US = c(4L, 4L, NA, NA, NA, 8L), `number of success` = c(4L, 
                                                                                                                                                                                                                                                                                                                     8L, 4L, 4L, 4L, 12L)), row.names = c(NA, 6L), class = "data.frame")
group_duration <- structure(list(date = structure(c(18628, 18632, 18633, 18634, 
                                                    18638, 18639), class = "Date"), duration_days = c(30, 31, 30, 
                                                                                                      53, 30, 42.6666666666667)), row.names = c(NA, -6L), class = c("tbl_df", 
                                                                                                                                                                    "tbl", "data.frame"))                    

library(dplyr)
ksM %>% 
  mutate(date = as.Date(date, "%m/%d/%Y")) %>% 
  left_join(group_duration, by = "date")
#>         date projects launched     pledged backers total goal mean goal US
#> 1 2021-01-01                 4  50278.6400    2880      24000      6000  4
#> 2 2021-01-05                 8 366279.5904    6588     148000     18500  4
#> 3 2021-01-06                 4 172073.0471    3528      60000     15000 NA
#> 4 2021-01-07                 4    230.5376      16       4000      1000 NA
#> 5 2021-01-11                 4 304353.5676    4204     140000     35000 NA
#> 6 2021-01-12                12 285277.8614    6632     198000     16500  8
#>   number of success duration_days
#> 1                 4      30.00000
#> 2                 8      31.00000
#> 3                 4      30.00000
#> 4                 4      53.00000
#> 5                 4      30.00000
#> 6                12      42.66667

Created on 2022-07-22 by the reprex package (v2.0.1)

  •  Tags:  
  • r
  • Related