Ruby/Rails dictionary app - 6 letter words finder that are built of two concatenated smaller words-CodePudding

I need to create functionality which is going to process the dictionary (dictionary.txt file). The goal is to find all six-letter words that are built of two concatenated smaller words e.g.:

con   vex => convex
tail   or => tailor
we   aver => weaver

Of course, there may be some words inside the file that are not 6 letters long, but these can be easily sifted out using a simple method:

def cleanup_file
  file_data = File.read('dictrionary.txt').split

  file_data.reject! { |word| word.size < 6 }
end

But now comes the problem - how to find if the other strings in the array are made of two connected smaller words ?

[Edit]

Sample dictionary.txt file here

CodePudding user response：

Thinking just in a pseudo code solution, but you should:

Iterate every line of the dictionary and store the words in 6 different arrays by the length of each word.
- Make sure that all words are downcased, there are no duplicates and all the values are sorted, so later you could properly use .bsearch in the arrays.
Iterate the length-6 array (for example convex) and look for a match of the first character of the current word in the length-1 array (c for the given example) and in the length-5 array (onvex). If there's a match, save the words.
- Then keep looking in the length-2 and length-4 arrays for matches (co and nvex correspondingly) and save a match.
- Finally, look both parts of the string in the length-3 array (con and vex) and save any match
- Look for the next 6 characters string until you've finished.

Most likely there are better ways to solve this, like in the first iteration inserting each word in its corresponding array using .bsearch_index to sort and not inserting duplicates in the same iteration, but most of the workload is going to be in the 2nd iteration and binary searches work in O(log n) time, so I guess it should work quick enough.

CodePudding user response：

Suppose the dictionary is as follows.

dictionary = [
  "a", "abased", "act", "action", "animal", "ape", "apeman",
  "art", "assertion", "bar", "barbed", "barhop", "based", "be",
  "become", "bed", "come", "hop", "ion", "man"
]

I assume that, like most dictionaries, dictionary is sorted.

First compute the following hash.

words_by_len = dictionary.each_with_object({}) do |w,h|
  len = w.length
  (h[len] ||= []) << w if len < 7
end    
  #=> {1=>["a"],
  #    6=>["abased", "action", "animal", "apeman", "barbed",
  #    "barhop", "become"],
  #    3=>["act", "ape", "art", "bar", "bed", "hop", "ion", "man"],
  #    5=>["based"],
  #    2=>["be"],
  #    4=>["come"]}

Each key is a word length (1-6) and each value is an array of words from dictionary whose length is the value of the key.

Next I will define a helper function that returns true or false depending on whether a given array of words (list) contains a given word (word).

def found?(list, word)
  w = list.bsearch { |w| w >= word }
  w && w == word
end

For example:

found?(words_by_len[3], "art")
  #=> true
found?(words_by_len[3], "any")
  #=> false

See Array#bsearch.

We now extract the words of interest:

words_by_len[6].select do |w|
  (1..5).any? do |i|
    first, last = w[0,i], w[i..-1]
    found?(words_by_len[i], first) && found?(words_by_len[6-i], last)
  end
end
  #=> ["abased", "action", "apeman", "barbed", "barhop", "become"]