Home > Mobile >  R - using function to create new column based on string comparison
R - using function to create new column based on string comparison

Time:06-18

I'm fairly new to R and I'm trying to write code to solve the Spelling Bee game on the NYTimes website to see how I'm doing. I tried writing a function to compare two strings ('given' and 'test_word') that returns TRUE if you can spell 'test_word' with only the letters from 'given' and FALSE otherwise. I got that to work, so I downloaded the enable1 wordlist and tried to apply that function to every word in the list. Instead of giving me a new column in the dataframe with the result of the function on each word, it just returns FALSE for every row, and I'm just confused as to what I'm doing wrong. It looks like it's just taking the value of the function for the first entry in the wordlist instead of looking at each word individually.

Here's my code:

library(dplyr)

is_good <- function(given, test_word) {
  
  diffs <- paste(unlist(setdiff(strsplit(test_word,'')[[1]],strsplit(given,'')[[1]])),collapse='')

  match = case_when(
        diffs == '' ~ TRUE,
        diffs != '' ~ FALSE
  )
  
  return(match)
  }

given <- 'CLEXION'

#words = read.csv('c:/Users/Dave/Documents/R/enable1.txt', header=FALSE)
# edited to add sample list of words
V1 <- c('AAHED','LEXICON','LION','COLLECTION')
words <- data.frame(V1)

names(words) <- c('word')
words <- filter(words, nchar(word)>=4)
words$word <- toupper(words$word)

words <- words %>% mutate(is_match = is_good(given,word))

After running all this, I get this output:

> filter(words, is_match == TRUE)
[1] word     is_match
<0 rows> (or 0-length row.names)

Just to check I ran a filter on a word I know should work and got

> filter(words, word=='LEXICON')
     word is_match
1 LEXICON    FALSE

If I run the function on its own with one word I get the expected result:

> is_good(given,'LEXICON')
[1] TRUE

Why is the function call in my mutate step not applying the function to each row? I'm getting comfortable with the idea of lists and data frames but there's obviously something I'm missing when putting it into practice.

UPDATE: I researched the lapply function and it did what I hoped - my new code looks like

  test_split <- lapply(test_word, function(w) {strsplit(w,'')[[1]]})
  given_split <- strsplit(given,'')[[1]]
  diff_1 <- lapply(test_split, function(x) {paste(unlist(setdiff(x, given_split)),collapse='')})
  
  match = lapply(diff_1, function(x) {
    case_when(
        x == '' ~ TRUE,
        x != '' ~ FALSE
  )})

CodePudding user response:

Answer to match the OP's view

is_good <- function(given, test_word) {
  
  test_split <- strsplit(test_word, split = "") # don't need lapply here since strsplit is already vectorized
  given_split <- strsplit(given,'')[[1]]
  diff_1 <- lapply(test_split, function(x) {paste(unlist(setdiff(x, given_split)),collapse='')})
  # From here, it is back to simple things!
  diff_1 <- unlist(diff_1)
  
  match <- diff_1 == ""
  return(match)
}


Thanks for providing sample data. It makes it easier to solve.

It is probably overkill, but here is the dplyr / tidyverse answer.

Note that |> is the base pipe (similar to %>%).(will work for R>=4.1.0)

Note you will need extra packages ( stringr and tidyr). Check if you have them installed.

If not already installed, run install.packages(c("tidyr", "stringr")

purrr::map() is used to manipulate elements of a list purrr::map_lgl() ensures you return a logical vector

Solution for dealing with duplicated letters

library(dplyr)
is_good <- function(given, test_word) {
  # Standardizing to upper case
  given <- toupper(given)
  test_word <- toupper(test_word)
  
  # Extracting letters
  given_letters <- stringr::str_split(given, pattern = "")
  given_letters <- unlist(given_letters)
  
  
  # This part deals with duplicated letters
  # there is probably a base R way to do it.
  given_letters <-  tibble(given_letter = given_letters) |> 
    group_by(given_letter) |> 
    mutate(letter = paste0(given_letter, row_number())) |> 
    pull(letter)
  # For word "DREAD", it will return ("D1", "R1", "E1", "D2")
  


  # Manipulating test_word
  letters_in_test_words <- stringr::str_split(test_word, pattern = "")

 # a little bit more complicated, but similar to previously to mark duplicated
 # letters. It outputs a list. Example: for input "THIN", "MINI"
 # a list of 2 
 # [[1]] : "T1", "H1", "I1", "N1" 
 # [[2]] : "M1", "I1", "N1" "I2"
  letters_in_test_words <- tibble(
    word_id = 1:length(letters_in_test_words),
    letter = letters_in_test_words
    ) |> 
    tidyr::unnest(letter) |> 
    group_by(word_id, letter) |> 
    mutate(letter = paste0(letter, dplyr::row_number())) |> 
    ungroup() |> 
    tidyr::nest(data = letter) |> 
    mutate(data = purrr::map(data, 1)) |> 
    pull(data)
  
  # iterates over the words to find if there is a complete match
  match <- purrr::map_lgl(letters_in_test_words, ~ all(.x %in% given_letters))
  match
}


given <- 'CLEXION'
#words = read.csv('c:/Users/Dave/Documents/R/enable1.txt', header=FALSE)
# edited to add sample list of words
V1 <- c('AAHED','LEXICON','LION','COLLECTION')
words <- data.frame(word = V1)

words <- filter(words, nchar(word)>=4)
words$word <- toupper(words$word) # a good idea to be put inside the function

is_good("AHMED","LEXICO1N")
#> [1] FALSE

words <- words %>% mutate(is_match = is_good(given,word))
words |> 
  filter(is_match)
#>      word is_match
#> 1 LEXICON     TRUE
#> 2    LION     TRUE

# My solution checks for duplicated letters
# You probably don't want this as TRUE.
is_good(given = "TRUCE", test_word =  "TRUCEE")
#> [1] FALSE

Created on 2022-06-17 by the reprex package (v2.0.1)

Note: My function could probably exist in base R as well, but I am better with tables. It is also overkill since it checks for duplicates.


A solution that doesn't check for duplicates (much simpler)

library(dplyr)
is_good <- function(given, test_word) {
  # Standardizing
  given <- toupper(given)
  test_word <- toupper(test_word)
  
  # Extracting letters
  given_letters <- stringr::str_split(given, pattern = "")
  given_letters <- unlist(given_letters)
  
  
  # Manipulating test_word
  letters_in_test_words <- stringr::str_split(test_word, pattern = "")
  # a little bit more complicated, but simi
  # iterates over the words to find if there is a complete match
  match <- purrr::map_lgl(letters_in_test_words, ~ all(.x %in% given_letters))
  match
}


given <- 'CLEXION'
#words = read.csv('c:/Users/Dave/Documents/R/enable1.txt', header=FALSE)
# edited to add sample list of words
V1 <- c('AAHED','LEXICON','LION','COLLECTION')
words <- data.frame(word = V1)

words <- filter(words, nchar(word)>=4)
words$word <- toupper(words$word) # a good idea to be put inside the function

is_good("AHMED","LEXICO1N")
#> [1] FALSE

words <- words %>% mutate(is_match = is_good(given,word))
words |> 
  filter(is_match)
#>      word is_match
#> 1 LEXICON     TRUE
#> 2    LION     TRUE

# You probably don't want this as TRUE. but it will come out as TRUE without the
# Duplicated letters are ignored.
is_good(given = "TRUCE", test_word =  "TRUCEE")
#> [1] TRUE

Created on 2022-06-17 by the reprex package (v2.0.1)

  • Related