Suppose I have a vector of strings contains job vacancy requirements df
. I also have a vector of strings contains programming language names prog_langs
. I am looking for an ellegent dplyr way, how can I create within mutate
method multiple columns for each programming language of vector prog_langs
with certain column names .name = "ProgLang_{prog_langs}" to test whether string of
df``` conitans particular progrmaming languge (TRUE if contains, FALSE otherwise).
# custom FUN
is_contains = function(txt, cond) if(grepl(cond, txt)) return(TRUE) else return(FALSE)
# Vector of programming languages
prog_langs = c("python", "java", "sql", "html")
# Vector of strings contains job vacancies requirements
df = data.frame("string" = c("exposure to scripting or programming languages (e.g python, c , or powershell).", "scripting skills (e.g. java, javascript, beanshell, etc.)",
"basic understanding of sql", "html and css knowledge is a must."))
# example of code
df %>%
mutate(across(.cols = vars(prog_langs), .fns = function(x) is_contains(txt = string, cond = x), .names = 'ProgLang_{.col}'))
Desired output:
New df
with N new columns (where N is the length of prog_langs, i.e. number of programming languages), each of columns must contain TRUE or FALSE.
CodePudding user response:
Using purrr::map
, purrr::transpose
and tidyr::unnest_wider
you could do:
library(dplyr, warn=FALSE)
library(purrr)
library(tidyr)
prog_langs <- c("python", "java", "sql", "html")
names(prog_langs) <- prog_langs
df %>%
mutate(ProgLang = transpose(map(prog_langs, ~ grepl(.x, string)))) %>%
unnest_wider(ProgLang)
#> # A tibble: 4 × 5
#> string python java sql html
#> <chr> <lgl> <lgl> <lgl> <lgl>
#> 1 exposure to scripting or programming languages (e.g … TRUE FALSE FALSE FALSE
#> 2 scripting skills (e.g. java, javascript, beanshell, … FALSE TRUE FALSE FALSE
#> 3 basic understanding of sql FALSE FALSE TRUE FALSE
#> 4 html and css knowledge is a must. FALSE FALSE FALSE TRUE
CodePudding user response:
This solution uses tidyr::crossing
to obtain the cartesian product between string
and prog_langs
, then looks for matches using grepl
and finally widens the data.frame
using tidyr::pivot_wider
library(purrr)
library(tidyr)
library(dplyr)
df |>
crossing(ProgLang = prog_langs) |>
mutate(contains = map2_lgl(ProgLang, string, ~grepl(.x, .y))) |>
pivot_wider(names_from = ProgLang,
values_from = contains,
names_prefix = "ProgLang_")
##> # A tibble: 4 × 5
##> string ProgLang_html ProgLang_java ProgLang_python ProgLang_sql
##> <chr> <lgl> <lgl> <lgl> <lgl>
##> 1 basic understanding … FALSE FALSE FALSE TRUE
##> 2 exposure to scriptin… FALSE FALSE TRUE FALSE
##> 3 html and css knowled… TRUE FALSE FALSE FALSE
##> 4 scripting skills (e.… FALSE TRUE FALSE FALSE
Edit as requested in the comment
# Vector of programming languages
prog_langs <- c("python", "java", "sql", "html")
certificate <- c("oscp", "cissp", "ceh")
data.frame(Type = "ProgLang", pattern = prog_langs) |>
rbind(data.frame(Type = "Certificate", pattern = certificate)) |>
crossing(df) |>
mutate(Contains = map2_lgl(pattern, string, grepl)) |>
mutate(Thing = paste(Type, pattern, sep = "_"),
.keep = "unused") |>
pivot_wider(names_from = Thing, values_from = Contains)
##> # A tibble: 4 × 8
##> string Certificate_ceh Certificate_cis… Certificate_oscp ProgLang_html
##> <chr> <lgl> <lgl> <lgl> <lgl>
##> 1 basic underst… FALSE FALSE FALSE FALSE
##> 2 exposure to s… FALSE FALSE FALSE FALSE
##> 3 html and css … FALSE FALSE FALSE TRUE
##> 4 scripting ski… FALSE FALSE FALSE FALSE
##> # … with 3 more variables: ProgLang_java <lgl>, ProgLang_python <lgl>,
##> # ProgLang_sql <lgl>