Home > Back-end >  Best way to perform unstructured text search in MySQL using FULLTEXT Index
Best way to perform unstructured text search in MySQL using FULLTEXT Index

Time:01-11

I have a large MySQL table with 20mm records of sales and 10 or so columns. Two of the columns are POSTCODE and STREET. I have configured the table so that i have a FULLTEXT index defined on these 2 columns. My problem is when I search this the table using the following query, I get a bunch of random results where I expect to get much closer matches... for example:

SELECT * FROM table1 WHERE MATCH(postcode, street) AGAINST ('W4 RAVEN' IN NATURAL LANGUAGE MODE)

Yields a bunch of results (see top 5 below)

postcode street
WS5 3PZ RAVEN ROAD
WD18 7DA RAVEN CLOSE
DE55 5NR RAVEN AVENUE
IP28 7LF RAVEN CLOSE
WD18 7DB RAVEN CLOSE

When in fact, if I run this query, I get the below result, which looks like a much more relevant result than the above to me... ?

SELECT postcode, street  from table1 where postcode LIKE "W4 5EQ%" LIMIT 5
postcode street
W4 5EQ RAVENSCROFT ROAD
W4 5EQ RAVENSCROFT ROAD
W4 5EQ RAVENSCROFT ROAD
W4 5EQ RAVENSCROFT ROAD
W4 5EQ RAVENSCROFT ROAD

Am I misunderstanding how the search works?

I would expect both of the queries above to return very similar results. The reason I want to use the first one though is because sometimes the user may just search RAVEN and sometimes just W4, so I dont want to limit them to the column type.

Edit to clarify: Given my table has records similar to what the second query is giving, I would have expected the first query to give something more similar to that than what it returned (a 'closer' match)

CodePudding user response:

I'm sorry to write that comparing the two queries' results is like comparing apples with pears!

In the first query you search for a part of a postcode and a part of the street name. Not sure why you are surprised that it returns rows where the searched string appears in the street name column! Not to mention the fact that W4 is probably ignored as it is too short for a word. The default minimum word length is 3 for innodb tables. If you reduced this limit to 2 and rebuilt the fulltext index, then your search results coud also be more accurate.

In the second query you search for a specific postcode only, therefore your results will be more specific as well.

Please remember, fulltext search does not behave like google search! It does not use machine learning or similar data scince techniques to guess the relationship between different parts of the searched expression!

CodePudding user response:

  • I suspect that IN NATURAL MODE is inappropriate for addresses. Suggest IN BOOLEAN MODE, perhaps with plus ( ) in front of each term.
  • The default innodb_ft_min_token_size value is 3. Hence, "W4" will be ignored. You could change that, however, it would mean rebuilding all FT indexes.
  • I suggest having a space in the postcode is counter-productive. As you insert data, remove all punctuation from it. Then search for W4*.
  • For a query that only checks prefixes, `LIKE 'W4%' may be faster than FT.
  • Mixing FT searching and non-FT searching may be slow.
  • Related