I need to create functionality which is going to process the dictionary (dictionary.txt file). The goal is to find all six-letter words that are built of two concatenated smaller words e.g.:
con vex => convex
tail or => tailor
we aver => weaver
Of course, there may be some words inside the file that are not 6 letters long, but these can be easily sifted out using a simple method:
def cleanup_file
file_data = File.read('dictrionary.txt').split
file_data.reject! { |word| word.size < 6 }
end
But now comes the problem - how to find if the other strings in the array are made of two connected smaller words ?
[Edit]
Sample dictionary.txt file here
CodePudding user response:
Thinking just in a pseudo code solution, but you should:
- Iterate every line of the dictionary and store the words in 6 different arrays by the length of each word.
- Make sure that all words are downcased, there are no duplicates and all the values are sorted, so later you could properly use
.bsearch
in the arrays.
- Make sure that all words are downcased, there are no duplicates and all the values are sorted, so later you could properly use
- Iterate the length-6 array (for example
convex
) and look for a match of the first character of the current word in the length-1 array (c
for the given example) and in the length-5 array (onvex
). If there's a match, save the words.- Then keep looking in the length-2 and length-4 arrays for matches (
co
andnvex
correspondingly) and save a match. - Finally, look both parts of the string in the length-3 array (
con
andvex
) and save any match - Look for the next 6 characters string until you've finished.
- Then keep looking in the length-2 and length-4 arrays for matches (
Most likely there are better ways to solve this, like in the first iteration inserting each word in its corresponding array using .bsearch_index
to sort and not inserting duplicates in the same iteration, but most of the workload is going to be in the 2nd iteration and binary searches work in O(log n)
time, so I guess it should work quick enough.
CodePudding user response:
Suppose the dictionary is as follows.
dictionary = [
"a", "abased", "act", "action", "animal", "ape", "apeman",
"art", "assertion", "bar", "barbed", "barhop", "based", "be",
"become", "bed", "come", "hop", "ion", "man"
]
I assume that, like most dictionaries, dictionary
is sorted.
First compute the following hash.
words_by_len = dictionary.each_with_object({}) do |w,h|
len = w.length
(h[len] ||= []) << w if len < 7
end
#=> {1=>["a"],
# 6=>["abased", "action", "animal", "apeman", "barbed",
# "barhop", "become"],
# 3=>["act", "ape", "art", "bar", "bed", "hop", "ion", "man"],
# 5=>["based"],
# 2=>["be"],
# 4=>["come"]}
Each key is a word length (1-6) and each value is an array of words from dictionary
whose length is the value of the key.
Next I will define a helper function that returns true
or false
depending on whether a given array of words (list
) contains a given word (word
).
def found?(list, word)
w = list.bsearch { |w| w >= word }
w && w == word
end
For example:
found?(words_by_len[3], "art")
#=> true
found?(words_by_len[3], "any")
#=> false
See Array#bsearch.
We now extract the words of interest:
words_by_len[6].select do |w|
(1..5).any? do |i|
first, last = w[0,i], w[i..-1]
found?(words_by_len[i], first) && found?(words_by_len[6-i], last)
end
end
#=> ["abased", "action", "apeman", "barbed", "barhop", "become"]