Home > Mobile >  Merging two dataframes based on conditions in multiple columns
Merging two dataframes based on conditions in multiple columns

Time:12-16

I am trying to create a new df, call it df3, out of two other datasets:

df1 = data.frame("String" = c("a", "b", "c"), "Title" = c("A", "B", "C"), "Date" = c("2020-01-01", "2020-01-02", "2020-01-03"))

and:

df2 = data.frame("String" = c("a", "x", "y"), "Title" = c("ABCDEF", "XYZ", "YZ"), "Date" = c("2020-01-03", "2020-01-20", "2020-01-30"))

The conditions for the observations that should be matched, and form a new dataset, are: df1$String %$in% df2$String grepl(df1$Title, df2$Title) == TRUE df1$Date < df$Date

What is the best way to do this kind of merging? I have tried to create an indicator along the lines of :

df1$indicator = ifelse(df1$String %in% df2$String & grepl(df1$Title, df2$Title) & df1$Date < df$Date, 1, 0)

or

df1$indicator = ifelse(df1$String %in% df2$String & grepl(df1$Title, df2$Title[df1$String %in% df2$String) & df1$Date < df2$Date[df1$String %in% df2$String, 1, 0)

to then use for merging, but I've been getting "longer object length is not a multiple of shorter object length" and "argument 'pattern' has length > 1 and only the first element will be used" warnings.

CodePudding user response:

One way: Use a crossjoin then filter the result. Note that grepl is not vectorized over both arguments, so i use mapply.

df1 = data.frame("String" = c("a", "b", "c"), "Title" = c("A", "B", "C"), "Date" = c("2020-01-01", "2020-01-02", "2020-01-03"))
df2 = data.frame("String" = c("a", "x", "y"), "Title" = c("ABCDEF", "XYZ", "YZ"), "Date" = c("2020-01-03", "2020-01-20", "2020-01-30"))


merge(df1,df2, by=NULL, suffixes = c(".x", ".y")) |> 
  subset(String.x %in% String.y 
         & mapply(grepl, Title.x, Title.y) 
         & Date.x < Date.y )
#>   String.x Title.x     Date.x String.y Title.y     Date.y
#> 1        a       A 2020-01-01        a  ABCDEF 2020-01-03
  • Related