Given:
sentence = "The world is a great place to live and eat"
dictionary = {
"ea": "These words rym with each" ,
"lace": "something"
}
Expected:
matched_words --> ["great", "place", "eat"]
My approach: First lookup the dictionary enries and then check for maches in the string
Object.entries(dictionary).map(([key]) => {
let regex = new RegExp(`[a-zA-Z]?${key}[a-zA-Z]?`, "g");
let keywords = text.match(regex);
return keywords
}
CodePudding user response:
In your example, it seems like its just the dictionary keys that matter. If that's so, then it's just the well-studied problem called String Searching. If performance really is that important, then you might have to study one of these algorithms and find a library that does it or implement it yourself.
For example, you could use KMP: https://en.wikipedia.org/wiki/Knuth–Morris–Pratt_algorithm
The performance for this is O([length of string] [length of pattern]) for each pattern, or in total, O([length of string] * [# patterns] [total length of patterns]). There is probably a faster way
CodePudding user response:
First of all, your code does not produce the results you expect. If you correct the variable name and add the missing parenthesis, it outputs:
[["reat", "eat"], ["place"]]
To get the desired result, you would need to change the regex so it matches more than one surrounding letter, and returns a one-dimensional array, using flatMap
:
const text = "The world is a great place to live and eat";
const dictionary = {
"ea": "These words rhyme with each" ,
"lace": "something"
};
const results = Object.entries(dictionary).flatMap(([key]) => {
let regex = new RegExp(`[a-zA-Z]*${key}[a-zA-Z]*`, "g");
let keywords = text.match(regex);
return keywords
});
console.log(results);
This algorithm still has a flaw: it can return the same word multiple times. This happens when multiple dictionary keys occur in the same word. For instance, if the text has the word "seaplace", it will come out twice.
If this is about rhyming ("rym"?), then you probably don't want to allow vowels following the pattern, and neither vowels immediately in front of the pattern. Still, the English language is much more complex than that, and two words with "ea" in their final syllable are not guaranteed to rhyme (The words "great", "eat", "near", "bear" and "linear" do not rhyme with each other). But I will leave that for you to define, as your question does not seem to be about the rhyming logic.
You can avoid the explicit loop and rely on one regular expression only, thereby moving the logic to compiled code in the JavaScript engine:
const text = "The world is a great place to live and eat";
const dictionary = {
"ea": "These words rhyme with each" ,
"lace": "something"
};
const regex = RegExp(`\\b\\w*(?:${Object.keys(dictionary).join("|")})\\w*`, "gi");
const results = text.match(regex);
console.log(results);
Note that this algorithm also ensures that even if a word in the text can match with multiple dictionary keys, it only comes out once.
CodePudding user response:
I'm not quite sure how in detail behind the scenes the effort in terms of an exponential iteration growth or just a multiple times iteration approach really is.
But since one has to use a dynamically to be built regex/pattern anyhow I would sort
the dictionary
's keys
array by length
and then apply its join
ed string as alternation to a single to be built regex which then gets match
ed exactly once against the OP's text
.
As for the OP's provided example the created regex then would be ... /\b\w*(lace|ea)\w*/g
.
keys
iterates oncesort
does not add an exponential growth, though the effort is slightly greater than a single iterationjoin
can be counted as a full iteration cycle as well.
Note
The keys
array has to be sorted descending by the length
of its containing keys in order to always assure the correct matching of the regex-alternations.
Summary
Everything boils down to how effective a regex which features alternations does perform.
Thus the OP might need to come up additionally with a performance test in order to figure out whether there are real bottlenecks amongst the different implementations of possible approaches.
const dictionary = {
ea: "These words rhyme with each" ,
lace: "something"
};
const text = "The world is a great place to live and eat";
const dictAlternatives = Object
.keys(dictionary)
.sort((a, b) => b.length - a.length)
.join('|');
console.log({
dictAlternatives,
regex: RegExp(`\\b\\w*(${ dictAlternatives })\\w*`, 'g'),
matchingResults: text.match(
RegExp(`\\b\\w*(${ dictAlternatives })\\w*\\b`, 'g')
)
});