Algorithm is a small white, a text fuzzy matching algorithm scheme of key words or ideas-CodePudding

Background:
1, artificial maintenance a keyword library (less than 10 w), a key feature: support for multiple languages (English), common length is longer than the other (the average length of 42)
2, given a piece of text, text is generally between 500-1000, may also be multilingual text can not be sure languages mixed, so let's all keyword match again

Requirements:
1, from the text to find all the occurrences of the keywords, location and keywords (for highlighting)
2, the keyword match is case-insensitive
3, when the keyword matching allows the wrong a few characters (such as keyword length between 30 to 50, allowing two characters matching error)

For example:
Keywords: Caffeine, allowing the wrong one character
Can match to the text in the following content: caffein, caffeina, caffeine, etc

CodePudding user response:

Himself wrote a simple violent matching method, a character in a character through text, and then the first letter of certain words matched filter, it says the amount of data, complete matching speed can also accept, increases the allowed wrong after a few characters of processing speed to 2-3 minutes, totally unacceptable, strives for the great god

CodePudding user response:

It is good to get a embedded database

CodePudding user response:

Don't know what is the relationship between embedded database with this demand, please say in detail

CodePudding user response:

Suggest the building research under state machine automatically, because you have allowed error requirement, need to undertake a certain deformation,

CodePudding user response:

Split the word do keyword (index, key), such as caffein can be split into caf, fei, ein, and then do the index
Match the same Caffeine search respectively, caf, fei, fine