R: conditionally mutate a variable when columns match in different dataframes-CodePudding

I am attempting to write some R code that assesses whether or not two dataframes have any matches in their columns. If there are matches, one of the columns in the second dataframe should assign a "link" (via the links variable) to the first dataframe using the id column of the first dataframe.

In the event that there are multiple matches, I am trying to get the "link" variable to randomly select one of the matching id's.

Some reproducible code:

library(dplyr)

df1 = data.frame(ids = c(1:5),
                 var = c("a","a","c","b","b"))

df2 = data.frame(var = c('c','a','b','b','d'),
                 links = 0)

Ideally, I would like a resulting dataframe that looks like:

  var    links
1   c        3
2   a   1 or 2
3   b   4 or 5
4   b   4 or 5
5   d        0

where observations in the links column randomly select ids from df1 when df1$var matches df2$var. In the dataframe above, this is denoted by "or".

Note 1: The links column should be a numeric, I only made it character to allow to write the word "or".

Note 2: If there is not a match between df1$var and df2$var, the links column should remain a 0.

So far, I've gone this route, but I'm unsure about what to put after the ~

linked_df = df2 %>%
     mutate(links=case_when(links==0 & var %in% df1$var ~ 
                            sample(c(df1$ids),n(),replace=T) # unsure about this line
                            TRUE ~ links)

CodePudding user response：

I think this is what you want. I've left the ids column in the result, but it can be removed when the sampling is complete.

library(dplyr)
library(tidyr)

df1_nest = df1 %>%
  group_by(var) %>% 
  summarize(ids = list(ids))

safe_sample = function(x, ...) {
  if(length(x) == 1) return(x)
  sample(x, ...)
}

set.seed(47)
df2 %>%
  left_join(df1_nest) %>%
  mutate(
    links = sapply(ids, \(x) if(is.null(x)) 0L else safe_sample(x, size = 1))
  )
# Joining, by = "var"
#   var links  ids
# 1   c     3    3
# 2   a     1 1, 2
# 3   b     4 4, 5
# 4   b     5 4, 5
# 5   d     0 NULL

CodePudding user response：

Something like this could do the trick, just a map of a filter of the first dataframe:

df2 %>%
  as_tibble() %>%
  mutate(links = map(var, ~sample(filter(df1, var == .)$ids), 1),
         index = row_number()) %>%
  unnest(links, keep_empty = TRUE) %>%
  group_by(index) %>%
  slice_sample(n = 1) %>%
  ungroup() %>%
  select(-index)

# # A tibble: 5 × 2
#   var   links
#   <chr> <int>
# 1 c         1
# 2 a         1
# 3 b         4
# 4 b         5
# 5 d        NA