MySQL using MATCH AGAINST for long unique values (8.0.27)-CodePudding

I have a situation where we're storing long unique IDs (up to 200 characters) that are single TEXT entries in our database. The problem is we're using a FULLTEXT index for speed purposes and it works great for the smaller GUID style entries. The problem is it won't work for the entries > 84 characters due to the limitations of innodb_ft_max_token_size, which apparently cannot be set > 84. This means any entries more than 84 characters are omitted from the Index.

Sample Entries (actual data from different sources I need to match):

AQMkADk22NgFmMTgzLTQ3MzEtNDYwYy1hZTgyLTBiZmU0Y2MBNDljMwBGAAADVJvMxLfANEeAePRRtVpkXQcAmNmJjI_T7kK7mrTinXmQXgAAAgENAAAAmNmJjI_T7kK7mrTinXmQXgABYpfCdwAAAA==

AND

<j938ir9r-XfrwkECA8Bxz6iqxVth-BumZCRIQ13On_inEoGIBnxva8BfxOoNNgzYofGuOHKOzldnceaSD0KLmkm9ET4hlomDnLu8PBktoi9-r-pLzKIWbV0eNadC3RIxX3ERwQABAgA=@t2.msgid.quoramail.com>

AND

["ca97826d-3bea-4986-b112-782ab312aq23","ca97826d-3bea-4986-b112-782ab312aaf7","ca97826d-3bea-4986-b112-782ab312a326"]

So what are my options here? Is there any way to get the unique strings of 160 (or so) characters working with a FULLTEXT index?

What's the most efficient Index I can use for large string values without spaces (up to 200 characters)?

CodePudding user response：

Here's a summary of the discussion in comments:

The id's have multiple formats, either a single token of variable length up to 200 characters, or even an "array," being a JSON-formatted document with multiple tokens. These entries come from different sources, and the format is outside of your control.

The FULLTEXT index implementation in MySQL has a maximum token size of 84 characters. This is not able to search for longer tokens.

You could use a conventional B-tree index (not FULLTEXT) to index longer strings, up to 3072 bytes in current versions of MySQL. But this would not support cases of JSON arrays of multiple tokens. You can't use a B-tree index to search for words in the middle of a string. Nor can you use an index with the LIKE predicate to match a substring using a wildcard in the front of the pattern.

Therefore to use a B-tree index, you must store one token per row. If you receive a JSON array, you would have to split this into individual tokens and store each one on a row by itself. This means writing some code to transform the content you receive as id's before inserting them into the database.

MySQL 8.0.17 supports a new kind of index on a JSON array, called a Multi-Value Index. If you could store all your tokens as a JSON array, even those that are received as single tokens, you could use this type of index. But this also would require writing some code to transform the singular form of id's into a JSON array.

The bottom line is that there is no single solution for indexing the text if you must support any and all formats. You either have to suffer with non-optimized searches, or else you need to find a way to modify the data so you can index it.

CodePudding user response：

Create a new table 2 columns: a VARCHAR(200) CHARSET ascii COLLATION ascii_bin (BASE64 needs case sensitivity.)
That table may have multiple rows for one row in your main table.
Use some simple parsing to find the string (or strings) in your table to add them to this new table.
PRIMARY KEY(that-big-column)
Update your code to also do the INSERT of new rows for new data.

Now a simple BTree lookup plus Join will solve all your plans.

TEXT does not work with indexes, but VARCHAR up to some limit does work. 200 with ascii is only 200 bytes, much below the 3072 limit.