Home > other >  Questions about fulltext-index performance
Questions about fulltext-index performance

Time:11-17

I searched for a search engine, but the price is expensive or it seems to take a lot of time to build, so I want to save it through 'singer name music name' and use it as a fullText-index.

However, even if I keep searching, it is constantly compared with like, and only the advantages of the function are introduced, and the disadvantages are hard to see even if I look for them.

Currently, there is no big problem except that the index grows when about 30,000 data is created, so I am going to use it, but I wonder what the downsides of fulltext-index are!

If you search additionally, they say that it is better to install a morpheme analyzer, but I plan to use an ngram parser with ngram=1.

Addition) If too large an index can be a problem, I wonder how to increase the ngram size and search like like for 2 letters or less.

CodePudding user response:

Downsides of FULLTEXT. Note that many can be worked around, but it takes deliberate extra code.

  • The definition of a "letter". That is, certain punctuation marks are treated as word breaks. You might be able to live with it, or you might sanitize the input (see below).
  • Min word length -- The default is 3. You can lower that to 2 or even 1, but there is a price to pay. Avoiding short words in the "sanitization" is one solution.
  • Short words and stop words cannot be forced to be present. (I think this is a bug.) AGAINST (" john q smith" IN BOOLEAN MODE) will always fail. This, on the other hand, will succeed, though it may deliver too many rows: AGAINST (" john q smith" IN BOOLEAN MODE). That is, "John W Smith" will succeed. See "fine tuning", below.
  • When the order of words is important (you want "foo bar", but not "bar foo" nor "foo blah blah bar"), there are some quoting you can do to force that in FT. (Watch out for short words.)
  • "stop words" -- There is a list of words not to index (eg, "the"). You can point to a different list. (I don't have the details off the top of my head.)
  • All or none -- Many of the tunable things I mention here apply to all FT indexes in the system. You may need to find a workaround to get the effect of different settings in different tables.

Sanitizing the input. Part numbers and model numbers are a nuisance to deal with because of the liberties that manufacturers take. Example: "Model 3" -- "Model" is a nice fulltext 'word', but '3' is not. You could [perhaps] tweak the input (when INSERTing) to say "Model3". Another example "xyz-123". Another: "John Q. Smith"

Combining columns. It may be beneficial to put together likely FT strings (from many columns) and throw them into a single column that is purely for searching. Do the sanitizing as you insert the data.

Fine-tuning. Here's one way to solve some of the hiccups:

WHERE MATCH(texts) AGAINST (" john q  smith" IN BOOLEAN MODE)
  AND texts LIKE "%John Q. Smith%"

Explanation... The FT test will come first and be fast. Then comes the LIKE (or REGEXP) which is much slower, but is not a problem because it is being applied only to those rows that passed the FT test. Example: "John Jones and Phil Smith" will pass the MATCH, but then be filtered out by the LIKE.

Some more notes: http://mysql.rjweb.org/doc.php/myisam2innodb#fulltext

Your question starts out implying "words", then digresses into cases where "ngram" may be better. If it is "names" (singers, music titles, etc), then ngram is probably not worth considering.

  • Related