Is there a faster method to compare string words against a dictionary with anything less than O(n^2)-CodePudding

Given:

sentence = "The world is a great place to live and eat"
dictionary = {
              "ea": "These words rym with each" ,
              "lace": "something"
             }

Expected:

matched_words --> ["great", "place", "eat"]

My approach: First lookup the dictionary enries and then check for maches in the string

Object.entries(dictionary).map(([key]) => {
        let regex = new RegExp(`[a-zA-Z]?${key}[a-zA-Z]?`, "g");
        let keywords = text.match(regex);
        return keywords
}

CodePudding user response：

In your example, it seems like its just the dictionary keys that matter. If that's so, then it's just the well-studied problem called String Searching. If performance really is that important, then you might have to study one of these algorithms and find a library that does it or implement it yourself.

For example, you could use KMP: https://en.wikipedia.org/wiki/Knuth–Morris–Pratt_algorithm

The performance for this is O([length of string] [length of pattern]) for each pattern, or in total, O([length of string] * [# patterns] [total length of patterns]). There is probably a faster way

CodePudding user response：

First of all, your code does not produce the results you expect. If you correct the variable name and add the missing parenthesis, it outputs:

[["reat", "eat"], ["place"]]

To get the desired result, you would need to change the regex so it matches more than one surrounding letter, and returns a one-dimensional array, using flatMap:

const text = "The world is a great place to live and eat";
const dictionary = {
    "ea": "These words rhyme with each" ,
    "lace": "something"
};

const results = Object.entries(dictionary).flatMap(([key]) => {
    let regex = new RegExp(`[a-zA-Z]*${key}[a-zA-Z]*`, "g");
    let keywords = text.match(regex);
    return keywords
});
console.log(results);

This algorithm still has a flaw: it can return the same word multiple times. This happens when multiple dictionary keys occur in the same word. For instance, if the text has the word "seaplace", it will come out twice.

If this is about rhyming ("rym"?), then you probably don't want to allow vowels following the pattern, and neither vowels immediately in front of the pattern. Still, the English language is much more complex than that, and two words with "ea" in their final syllable are not guaranteed to rhyme (The words "great", "eat", "near", "bear" and "linear" do not rhyme with each other). But I will leave that for you to define, as your question does not seem to be about the rhyming logic.

You can avoid the explicit loop and rely on one regular expression only, thereby moving the logic to compiled code in the JavaScript engine:

const text = "The world is a great place to live and eat";
const dictionary = {
    "ea": "These words rhyme with each" ,
    "lace": "something"
};

const regex = RegExp(`\\b\\w*(?:${Object.keys(dictionary).join("|")})\\w*`, "gi");

const results = text.match(regex);
console.log(results);

Note that this algorithm also ensures that even if a word in the text can match with multiple dictionary keys, it only comes out once.

CodePudding user response：

I'm not quite sure how in detail behind the scenes the effort in terms of an exponential iteration growth or just a multiple times iteration approach really is.

But since one has to use a dynamically to be built regex/pattern anyhow I would sort the dictionary's keys array by length and then apply its joined string as alternation to a single to be built regex which then gets matched exactly once against the OP's text.

As for the OP's provided example the created regex then would be ... /\b\w*(lace|ea)\w*/g.

keys iterates once
sort does not add an exponential growth, though the effort is slightly greater than a single iteration
join can be counted as a full iteration cycle as well.

Note

The keys array has to be sorted descending by the length of its containing keys in order to always assure the correct matching of the regex-alternations.

Summary

Everything boils down to how effective a regex which features alternations does perform.

Thus the OP might need to come up additionally with a performance test in order to figure out whether there are real bottlenecks amongst the different implementations of possible approaches.

const dictionary = {
  ea: "These words rhyme with each" ,
  lace: "something"
};
const text = "The world is a great place to live and eat";

const dictAlternatives = Object
  .keys(dictionary)
  .sort((a, b) => b.length - a.length)
  .join('|');

console.log({
  dictAlternatives,
  regex: RegExp(`\\b\\w*(${ dictAlternatives })\\w*`, 'g'),
  matchingResults: text.match(
    RegExp(`\\b\\w*(${ dictAlternatives })\\w*\\b`, 'g')
  )
});