Search elements of a single character string in a dataframe column to subset it-CodePudding

I have two dataframes:


set.seed(1)
df1 <- data.frame(k1 = "AFD(1);Acf(2);Vgr7(2);"
                  ,k2 = "ABC(7);BHG(46);TFG(675);")

df2 <- data.frame(site =c("AFD(1);AFD(2);", "Acf(2);", "TFG(677);", 
"XX(275);",  "ABC(7);", "ABC(9);")
                  ,p1 = rnorm(6, mean = 5, sd = 2)
                  ,p2 = rnorm(6, mean = 6.5, sd = 2))

The first dataframe is in fact a list of often very long strings, made of 'elements". Each "element" is made of a few letters/numbers, followed by a number in brackets, followed by a semicolon. In this example I only put 3 "elements" into each string, but in my real dataframe there are tens to hundreds of them.

> df1
                      k1                       k2
1 AFD(1);Acf(2);Vgr7(2); ABC(7);BHG(46);TFG(675);

The second dataframe shares some of the "elements" with df1. Its first column, called site, contains some (not all) "elements" from the first dataframe, sometimes the "element" forms the whole string, and sometimes is a part of a longer string:

> df2
            site       p1       p2
1 AFD(1);AFD(2); 4.043700 3.745881
2        Acf(2); 5.835883 5.670011
3      TFG(677); 7.717359 5.711420
4       XX(275); 4.794425 6.381373
5        ABC(7); 5.775343 8.700051
6        ABC(9); 4.892390 8.026351

I would like to filter the whole df2 using df2$site and each k column from df1 (there are many K columns, not all of them contain k in the names).

The easiest way to explain this is to show how the desired output would look like.

> outcome
    k   site            p1          p2
1   k1  AFD(1);AFD(2):  4.043700    3.745881
2   k1  Acf(2);         5.835883    5.670011
3   k2  ABC(7);         5.775343    8.700051

The first column of the outcome dataframe corresponds to the column names in df1. The second column corresponds to the site column of df2 and contains only sites from df1 columns that were found in df2$sites. Other columns are from df2.

I appreciate that this question is made of two separate "problems", one grepping-related and one related to looping through df1 columns. I decided to show the task in its entirety in case there exists a solution that addresses both in one go.

FAILED SOLUTION 1

I can create a string to grep, but for each column separately:

# this replaces the semicolons with "|", but does not escape the brackets.
k1_pattern <- df1 %>% 
  select(k1) %>% 
  deframe() %>% 
  str_replace_all(";","|")

And then I am not sure how to use it. This (below) didn't work, maybe because I didn't escape brackets, but I am struggling with doing it:

k1_result <- df2 %>% 
  filter(grepl(pattern = k1_pattern, site))

But even if it did work, it would only deal with a single column from df1, and I have many, and would like to perform this operation on all df1 columns at the same time.

FAILED SOLUTION 2

I can create a list of sites to search in df2 from columns in df1:

k1_sites<- df1 %>% 
  select(k1) %>% 
  deframe() %>% 
  strsplit(., "[;]") %>% 
  unlist()

but the delimiter is lost here, and %in% cannot be used, as the match will sometimes be partial.

CodePudding user response：

library(dplyr)

df2 %>% 
  mutate(site_list = strsplit(site, ";")) %>% 
  rowwise() %>% 
  filter(length(intersect(site_list,
                              unlist(strsplit(x = paste0(c(t(df1)), collapse=""), 
                                              split = ";")))) != 0) %>% 
  select(-site_list)

#> # A tibble: 3 x 3
#> # Rowwise: 
#>   site              p1    p2
#>   <chr>          <dbl> <dbl>
#> 1 AFD(1);AFD(2);  3.75  7.47
#> 2 Acf(2);         5.37  7.98
#> 3 ABC(7);         5.66  9.52

Updated answer:

library(dplyr)
library(tidyr)

df1 %>% 
  rownames_to_column("id") %>% 
  pivot_longer(-id, names_to = "k", values_to = "site") %>% 
  separate_rows(site, sep = ";") %>% 
  filter(site != "") %>% 
  select(-id) -> df1_k


df2 %>% 
  tibble::rownames_to_column("id") %>% 
  separate_rows(site, sep = ";") %>% 
  full_join(., df1_k, by = c("site")) %>% 
  group_by(id) %>% 
  fill(k, .direction = "downup") %>% 
  filter(!is.na(id) & !is.na(k)) %>% 
  summarise(k = first(k), 
            site = paste0(site, collapse = ";"),
            p1 = first(p1),
            p2 = first(p2), .groups = "drop") %>% 
  select(-id)

#> # A tibble: 3 x 4
#>   k     site              p1    p2
#>   <chr> <chr>          <dbl> <dbl>
#> 1 k1    AFD(1);AFD(2);  3.75  7.47
#> 2 k1    Acf(2);         5.37  7.98
#> 3 k2    ABC(7);         5.66  9.52

CodePudding user response：

Here's a way going to a long format for exact matching (so no regex):

library(dplyr)
library(tidyr)
df1_long = df1 |> stack() |>
  separate_rows(values, sep = ";") |>
  filter(values != "")

df2 |>
  mutate(id = row_number()) |>
  separate_rows(site, sep = ";") |>
  filter(site != "") |>
  left_join(df1_long, by = c("site" = "values")) %>%
  group_by(id) |>
  filter(any(!is.na(ind))) %>%
  summarize(
    site = paste(site, collapse = ";"),
    across(-site, \(x) first(na.omit(x)))
  )
# # A tibble: 3 × 5
#      id site             p1    p2 ind  
#   <int> <chr>         <dbl> <dbl> <fct>
# 1     1 AFD(1);AFD(2)  3.75  7.47 k1   
# 2     2 Acf(2)         5.37  7.98 k1   
# 3     5 ABC(7)         5.66  9.52 k2