retain only rows and columns that match with a string vector-CodePudding

I have a large DF with certain columns that have a vector of character values as below. The number of columns varies from dataset to dataset as well as the number of character vectors it holds also varies.

 ID                       Country1        Country2    Country3
1  1   Argentina, Japan,USA,Poland,   Argentina,USA    Pakistan
2  2 Colombia, Mexico,Uruguay,Dutch  Mexico,Uruguay Afganisthan
3  3        Argentina, Japan,USA,NA           Japan Khazagistan
4  4 Colombia, Mexico,Uruguay,Dutch Colombia, Dutch North Korea
5  5                   India, China           China        Iran

Would like to match them one-to-one with another string vector as below

vals_to_find <-c("Argentina","USA","Mexico")

If, a column/row matches to anyone of the strings passed would like to retain that column and row. Remove duplicates, and finally remove those values that do not match.

the desired output is as follows

  ID Countries.found
1  1  Argentina, USA
2  2          Mexico
3  3  Argentina, USA
4  4          Mexico

data

dput(df)
structure(list(ID = 1:5, Country1 = c("Argentina, Japan,USA,Poland,", 
"Colombia, Mexico,Uruguay,Dutch", "Argentina, Japan,USA,NA", 
"Colombia, Mexico,Uruguay,Dutch", "India, China"), Country2 = c("Argentina,USA", 
"Mexico,Uruguay", "Japan", "Colombia, Dutch", "China"), Country3 = c("Pakistan", 
"Afganisthan", "Khazagistan", "North Korea", "Iran")), class = "data.frame", row.names = c(NA, 
-5L))

dput(df_out) 

structure(list(ID = 1:4, Countries.found = c("Argentina, USA", 
"Mexico", "Argentina, USA", "Mexico")), class = "data.frame", row.names = c(NA, 
-4L))

Instead of a each column as a vector, if the file is read as one value per column. Then, was able do it as below

dput(df_out)
structure(list(ID = 1:5, X1 = c("Argentina", "Colombia", "Argentina", 
"Colombia", "India"), X2 = c("Japan", "Mexico", "Japan", "Mexico", 
"China"), X3 = c("USA", "Uruguay", "USA", "Uruguay", NA), X4 = c("Poland", 
"Dutch", NA, "Dutch", NA), X5 = c("Argentina", "Mexico", "Japan", 
"Colombia", "China"), X6 = c("USA", "Uruguay", NA, "Dutch", NA
), X7 = c("Pakistan", "Afganisthan", "Khazagistan", "North Korea", 
"Iran")), class = "data.frame", row.names = c(NA, -5L))


df_out %>%
  dplyr::select(
    where(~ !all(is.na(.x)))
  ) %>%
  dplyr::select(c(1, where(~ any(.x %in% vals_to_find)))) %>% 
  dplyr::mutate(dplyr::across(
    tidyselect::starts_with("X"),
    ~ vals_to_find[match(., vals_to_find)]
  )) %>%
  tidyr::unite("countries_found", tidyselect::starts_with("X"),
               sep = " | ", remove = TRUE, na.rm = TRUE
  )

Output

 ID                   countries_found
1  1 Argentina | USA | Argentina | USA
2  2                   Mexico | Mexico
3  3                   Argentina | USA
4  4                            Mexico

CodePudding user response：

unite the "Country" columns, then create a long vector by separating the values into rows, get all distinct values per ID, filter only those who are in vals_to_find, and summarise each countries.found toString.

library(tidyr)
library(dplyr)
df %>% 
  unite("Country", starts_with("Country"), sep = ",") %>% 
  separate_rows(Country) %>% 
  distinct(ID, Country) %>% 
  filter(Country %in% vals_to_find) %>% 
  group_by(ID) %>% 
  summarise(Countries.found = toString(Country))

output

# A tibble: 4 × 2
     ID Countries.found
  <int> <chr>          
1     1 Argentina, USA 
2     2 Mexico         
3     3 Argentina, USA 
4     4 Mexico

CodePudding user response：

We may use

library(dplyr)
library(tidyr)
library(stringr)
df %>%
   mutate(across(starts_with("Country"),
   ~ str_extract_all(.x, str_c(vals_to_find, collapse = "|")))) %>% 
  pivot_longer(cols = -ID, names_to = NULL, 
     values_to = 'Countries.found') %>%
  unnest(Countries.found) %>% 
  distinct %>% 
  group_by(ID) %>%
  summarise(Countries.found = toString(Countries.found))

-output

# A tibble: 4 × 2
     ID Countries.found
  <int> <chr>          
1     1 Argentina, USA 
2     2 Mexico         
3     3 Argentina, USA 
4     4 Mexico