I'm fairly new to R and I'm trying to write code to solve the Spelling Bee game on the NYTimes website to see how I'm doing. I tried writing a function to compare two strings ('given' and 'test_word') that returns TRUE if you can spell 'test_word' with only the letters from 'given' and FALSE otherwise. I got that to work, so I downloaded the enable1 wordlist and tried to apply that function to every word in the list. Instead of giving me a new column in the dataframe with the result of the function on each word, it just returns FALSE for every row, and I'm just confused as to what I'm doing wrong. It looks like it's just taking the value of the function for the first entry in the wordlist instead of looking at each word individually.
Here's my code:
library(dplyr)
is_good <- function(given, test_word) {
diffs <- paste(unlist(setdiff(strsplit(test_word,'')[[1]],strsplit(given,'')[[1]])),collapse='')
match = case_when(
diffs == '' ~ TRUE,
diffs != '' ~ FALSE
)
return(match)
}
given <- 'CLEXION'
#words = read.csv('c:/Users/Dave/Documents/R/enable1.txt', header=FALSE)
# edited to add sample list of words
V1 <- c('AAHED','LEXICON','LION','COLLECTION')
words <- data.frame(V1)
names(words) <- c('word')
words <- filter(words, nchar(word)>=4)
words$word <- toupper(words$word)
words <- words %>% mutate(is_match = is_good(given,word))
After running all this, I get this output:
> filter(words, is_match == TRUE)
[1] word is_match
<0 rows> (or 0-length row.names)
Just to check I ran a filter on a word I know should work and got
> filter(words, word=='LEXICON')
word is_match
1 LEXICON FALSE
If I run the function on its own with one word I get the expected result:
> is_good(given,'LEXICON')
[1] TRUE
Why is the function call in my mutate step not applying the function to each row? I'm getting comfortable with the idea of lists and data frames but there's obviously something I'm missing when putting it into practice.
UPDATE: I researched the lapply function and it did what I hoped - my new code looks like
test_split <- lapply(test_word, function(w) {strsplit(w,'')[[1]]})
given_split <- strsplit(given,'')[[1]]
diff_1 <- lapply(test_split, function(x) {paste(unlist(setdiff(x, given_split)),collapse='')})
match = lapply(diff_1, function(x) {
case_when(
x == '' ~ TRUE,
x != '' ~ FALSE
)})
CodePudding user response:
Answer to match the OP's view
is_good <- function(given, test_word) {
test_split <- strsplit(test_word, split = "") # don't need lapply here since strsplit is already vectorized
given_split <- strsplit(given,'')[[1]]
diff_1 <- lapply(test_split, function(x) {paste(unlist(setdiff(x, given_split)),collapse='')})
# From here, it is back to simple things!
diff_1 <- unlist(diff_1)
match <- diff_1 == ""
return(match)
}
Thanks for providing sample data. It makes it easier to solve.
It is probably overkill, but here is the dplyr / tidyverse answer.
Note that |>
is the base pipe (similar to %>%
).(will work for R>=4.1.0)
Note you will need extra packages ( stringr and tidyr). Check if you have them installed.
If not already installed, run
install.packages(c("tidyr", "stringr")
purrr::map()
is used to manipulate elements of a list
purrr::map_lgl()
ensures you return a logical vector
Solution for dealing with duplicated letters
library(dplyr)
is_good <- function(given, test_word) {
# Standardizing to upper case
given <- toupper(given)
test_word <- toupper(test_word)
# Extracting letters
given_letters <- stringr::str_split(given, pattern = "")
given_letters <- unlist(given_letters)
# This part deals with duplicated letters
# there is probably a base R way to do it.
given_letters <- tibble(given_letter = given_letters) |>
group_by(given_letter) |>
mutate(letter = paste0(given_letter, row_number())) |>
pull(letter)
# For word "DREAD", it will return ("D1", "R1", "E1", "D2")
# Manipulating test_word
letters_in_test_words <- stringr::str_split(test_word, pattern = "")
# a little bit more complicated, but similar to previously to mark duplicated
# letters. It outputs a list. Example: for input "THIN", "MINI"
# a list of 2
# [[1]] : "T1", "H1", "I1", "N1"
# [[2]] : "M1", "I1", "N1" "I2"
letters_in_test_words <- tibble(
word_id = 1:length(letters_in_test_words),
letter = letters_in_test_words
) |>
tidyr::unnest(letter) |>
group_by(word_id, letter) |>
mutate(letter = paste0(letter, dplyr::row_number())) |>
ungroup() |>
tidyr::nest(data = letter) |>
mutate(data = purrr::map(data, 1)) |>
pull(data)
# iterates over the words to find if there is a complete match
match <- purrr::map_lgl(letters_in_test_words, ~ all(.x %in% given_letters))
match
}
given <- 'CLEXION'
#words = read.csv('c:/Users/Dave/Documents/R/enable1.txt', header=FALSE)
# edited to add sample list of words
V1 <- c('AAHED','LEXICON','LION','COLLECTION')
words <- data.frame(word = V1)
words <- filter(words, nchar(word)>=4)
words$word <- toupper(words$word) # a good idea to be put inside the function
is_good("AHMED","LEXICO1N")
#> [1] FALSE
words <- words %>% mutate(is_match = is_good(given,word))
words |>
filter(is_match)
#> word is_match
#> 1 LEXICON TRUE
#> 2 LION TRUE
# My solution checks for duplicated letters
# You probably don't want this as TRUE.
is_good(given = "TRUCE", test_word = "TRUCEE")
#> [1] FALSE
Created on 2022-06-17 by the reprex package (v2.0.1)
Note: My function could probably exist in base R as well, but I am better with tables. It is also overkill since it checks for duplicates.
A solution that doesn't check for duplicates (much simpler)
library(dplyr)
is_good <- function(given, test_word) {
# Standardizing
given <- toupper(given)
test_word <- toupper(test_word)
# Extracting letters
given_letters <- stringr::str_split(given, pattern = "")
given_letters <- unlist(given_letters)
# Manipulating test_word
letters_in_test_words <- stringr::str_split(test_word, pattern = "")
# a little bit more complicated, but simi
# iterates over the words to find if there is a complete match
match <- purrr::map_lgl(letters_in_test_words, ~ all(.x %in% given_letters))
match
}
given <- 'CLEXION'
#words = read.csv('c:/Users/Dave/Documents/R/enable1.txt', header=FALSE)
# edited to add sample list of words
V1 <- c('AAHED','LEXICON','LION','COLLECTION')
words <- data.frame(word = V1)
words <- filter(words, nchar(word)>=4)
words$word <- toupper(words$word) # a good idea to be put inside the function
is_good("AHMED","LEXICO1N")
#> [1] FALSE
words <- words %>% mutate(is_match = is_good(given,word))
words |>
filter(is_match)
#> word is_match
#> 1 LEXICON TRUE
#> 2 LION TRUE
# You probably don't want this as TRUE. but it will come out as TRUE without the
# Duplicated letters are ignored.
is_good(given = "TRUCE", test_word = "TRUCEE")
#> [1] TRUE
Created on 2022-06-17 by the reprex package (v2.0.1)