In R, how to combine two dataframes while there are multiple lists that need to be match?-CodePudding

I have 2 data frames,

dfA is a large one including all cities and all health data 2020-2021 except for health data of city c in 2021 which is marked as “-3”.

dfA

City Name Pulse20 Pulse21
CityA Amy 77 78
CityB Bob 80 79
CityC Cathy 79 -3

dfB is a small one including the data I want to fill in dfA

dfB

City Name Pulse21
CityC Cathy 80

Request： 1.how to combine this two dataframe in a common use way？

2.if I use “full_join”，the Pulse21 would be listed into “Pulse21.x””Pulse21.y”，therefore I need to do more bind jobs

3.For the record，in my real data， each city has more than 500 people， and the health data would be like 100 and more.

So is there any other things I can to make it more simple and efficient？ thanks a lot！

CodePudding user response：

dplyr::rows_update(dfA, dfB, c('City', 'Name'))

   City  Name Pulse20 Pulse21
1 CityA   Amy      77      78
2 CityB   Bob      80      79
3 CityC Cathy      79      80

CodePudding user response：

I would still use full_join(), but with only one suffix, and then use mutate() to fix the problem.

Here is an example:

library(dplyr)

# Join with full_join, but only add suffix to one column
df_joined <- 
  dfA %>% 
  full_join(dfB, by = c("City", "Name"), suffix = c("", "_replacement"))

df_joined
#> # A tibble: 3 × 5
#>   City  Name  Pulse20 Pulse21 Pulse21_replacement
#>   <chr> <chr>   <dbl>   <dbl>               <dbl>
#> 1 CityA Amy        77      78                  NA
#> 2 CityB Bob        80      79                  NA
#> 3 CityC Cathy      79      -3                  80

# Fix issue with CityC
df_joined %>% 
  mutate(
    Pulse21 = ifelse(is.na(Pulse21_replacement), Pulse21, Pulse21_replacement)
  ) %>% 
  select(-Pulse21_replacement)
#> # A tibble: 3 × 4
#>   City  Name  Pulse20 Pulse21
#>   <chr> <chr>   <dbl>   <dbl>
#> 1 CityA Amy        77      78
#> 2 CityB Bob        80      79
#> 3 CityC Cathy      79      80

^{Created on 2022-04-17 by the reprex package (v2.0.1)}

Some other options

Other possible solutions include bind_rows() together with summarize(), or dplyr::rows_update() as suggested by @KU99 - but note that this function is experimental and might change with future updates of dplyr.

Data

dfA <- tibble::tribble(
~City, ~Name, ~Pulse20, ~Pulse21,
"CityA", "Amy", 77, 78,
"CityB", "Bob", 80, 79,
"CityC", "Cathy", 79, -3
)

dfB <- tibble::tribble(
~City, ~Name, ~Pulse21,
"CityC", "Cathy", 80
)