I am trying to create a new df, call it df3, out of two other datasets:
df1 = data.frame("String" = c("a", "b", "c"), "Title" = c("A", "B", "C"), "Date" = c("2020-01-01", "2020-01-02", "2020-01-03"))
and:
df2 = data.frame("String" = c("a", "x", "y"), "Title" = c("ABCDEF", "XYZ", "YZ"), "Date" = c("2020-01-03", "2020-01-20", "2020-01-30"))
The conditions for the observations that should be matched, and form a new dataset, are: df1$String %$in% df2$String grepl(df1$Title, df2$Title) == TRUE df1$Date < df$Date
What is the best way to do this kind of merging? I have tried to create an indicator along the lines of :
df1$indicator = ifelse(df1$String %in% df2$String & grepl(df1$Title, df2$Title) & df1$Date < df$Date, 1, 0)
or
df1$indicator = ifelse(df1$String %in% df2$String & grepl(df1$Title, df2$Title[df1$String %in% df2$String) & df1$Date < df2$Date[df1$String %in% df2$String, 1, 0)
to then use for merging, but I've been getting "longer object length is not a multiple of shorter object length" and "argument 'pattern' has length > 1 and only the first element will be used" warnings.
CodePudding user response:
One way: Use a crossjoin then filter the result. Note that grepl is not vectorized over both arguments, so i use mapply.
df1 = data.frame("String" = c("a", "b", "c"), "Title" = c("A", "B", "C"), "Date" = c("2020-01-01", "2020-01-02", "2020-01-03"))
df2 = data.frame("String" = c("a", "x", "y"), "Title" = c("ABCDEF", "XYZ", "YZ"), "Date" = c("2020-01-03", "2020-01-20", "2020-01-30"))
merge(df1,df2, by=NULL, suffixes = c(".x", ".y")) |>
subset(String.x %in% String.y
& mapply(grepl, Title.x, Title.y)
& Date.x < Date.y )
#> String.x Title.x Date.x String.y Title.y Date.y
#> 1 a A 2020-01-01 a ABCDEF 2020-01-03