I have a MySQL database, in which I need to store variable-sized floating-point arrays. These arrays are profile measurements made from a metrology tool (an ellipsometer to be specific) in a lab. The number of points per array varies depending on what recipe the user used: the number of points currently ranges from about 5 to 199 (the largest I've seen), but the max value could increased if a user made a recipe in the future that measured more points. Most of the arrays are 49 points (about 70%), and only 0.04% used 199 points.
Current, we are storing the arrays in a MySQL database as a as a comma-separated string as TEXT datatype. But retrieval requires parsing and typecasting, which makes it very slow. It also takes up about twice as much space as a 32bit binary array would. My question is what is the best way to store this data? I do not need to search or sort or join-to this column: just retrieve it alongside the rest of the data that goes with the measurement.
CodePudding user response:
There's a SQLish way of doing this, and it does not involve commas or BLOBs.
Create a datapoints
table with three columns:
- datapoint_id BIGINT UNSIGNED AUTO_INCREMENT NOT NULL
- measurement_id BIGINT UNSIGNED
- datapoint FLOAT ( or DOUBLE if you need it)
Then store your sequences of datapoints into the table, in order, in multiple rows. Each distinct measurement gets its own measurement_id value.
You can retrieve a measurement of any length with
SELECT datapoint
FROM datapoints
WHERE measurement_id = whatever
ORDER BY datapoint_id;
This may seem like too many rows to store. But, SQL is optimized really well for this sort of use pattern.
Creating this index will make things really fast.
CREATE INDEX points ON datapoints
(measurement_id, datapoint_id, datapoint);
And, you can make each measurement contain as few or as many datapoints as needed.
Edit. With this structure you can use SQL to get all sorts of descriptive statistics really easily.
SELECT COUNT(*),
MAX(datapoint),
MIN(datapoint),
AVG(datapoint),
STDEV(datapoint)
FROM datapoints
GROUP BY measurement_id;
The index I suggested will accelerate that operation too.
If you really do need arrays, consider postgresql in place of Mysql. It has them in its language.
CodePudding user response:
All FLOATS in a single table
A single table with about 10K rows is trivial for a database to handle. It would have at least the FLOAT and something like recipe_id. (See OJones's Answer for more discussion.)
JSON
Most languages can handle JSON. With json_decode() and json_encode() (or however they are spelled), you can easily turn an array into a sting and vice versa. The database version may have a JSON
datatype or simply use TEXT
.
BLOB
A BLOB gets trickier because of the typecasting that would be required; it is easier in old languages like C / C , but harder in newer languages. You mentioned JMP as the datasource, but not the language you are using.