Home > Net >  Substitute strings based on a match between two df
Substitute strings based on a match between two df

Time:10-04

I have two data.frame:

df1 <- data.frame(species = c("aa", "aa", "aa", "ab", "ab",
                               "ac", "ab", "aa", "ba", "bb"))

df2 <- data.frame(original.search = c("aa", "aa", "aa", "ab", "ab",
                               "ac", "ab", "aa", "ba", "bb"),
                  search.str = c("aa2", "aa2", "aa2", "ab2", "ab2",
                               "ac2", "ab2", "aa2", "ba2", "bb2"))

(Just for a reference, string values in each name vector are more like "Ocotea porosa", "Alchornea triplinervia" — I just used simple names to better exemplify)

What I'm trying to do is, for each match between species from df1 and original.search from df2, substitute the value of species in df1 by the search.str value in df2.

In the example, both data.frame are the same size and the columns are already matched, but my real df1 has close to 100,000 rows and my df2 are the unique values found in df1$species

How to proceed?

EDIT: If no matches exist between species and original.search, then I need to keep the value from species in the cell.

CodePudding user response:

Here is an option - create a named vector from the unique rows of 'df2', and use that to match and replace the 'species' column

df1$species <- with(unique(df2), setNames(search.str, 
          original.search))[df1$species]

Regarding the non matches cases, we may use coalesce after doing the same as above i.e. deframe (from tibble creates a named vector from the 'df2' uniqued, and then does the match and replace with original 'species' column and coalesce with species so that if there are any NAs in the replaced column for a row, it gets replaced with the corresponding 'species' value

library(dplyr)
library(tibble)
df1 %>% 
  mutate(species = coalesce(deframe(unique(df2))[species], species))
  • Related