How to set up MYSQL Tables for fast SELECT-CodePudding

The question is about *.FIT files (link to definition) (1 to extremely many and constantly more), from Sports watches, speedometers, in which there is always a timestamp (1 to n seconds), as well as 1 to n further parameters (which also have either a timestamp or a counter from 1 to x).

To perform data analysis, I need the data in the database to calculate e.g. the heart rates in relation to the altitude over several FIT files / training units / time periods.

Because of the changing number of parameters in a FIT file (depending on the connected devices, the device that created the file, etc.) and the possibility to integrate more/new parameters in the future, my idea was to have a separate table for each parameter instead of writing everything in one big table (which would then have extremely many "empty" cells whenever a parameter is not present in a FIT file).

Basic tables:

1 x tbl_file

id	filename	date
1	xyz.fit	2022-01-01
2	vwx.fit	2022-01-02
..	..	..

n x tbl_parameter_xy / tbl_ parameter_yz / ....

id	timestamp/counter	file_id	value
1	0	1	value
2	1	1	value
3	0	2	value
..	..	..	..

And these parameter tables would then be linked to each other via the file_id as well as to the FIT File.

I then used a test server, set up a MYSQL-DB to test this and was shocked:

SELECT * FROM tbl_parameter_xy as x
LEFT JOIN tbl_parameter_yz as y
ON x.file_id = y.file_id
WHERE x.file_id = 999

Took almost 30 seconds to give me the results.

In my parameter tables there are 209918 rows. file_id 999 consists of 1964 rows. But my SELECT with JOIN returns 3857269 rows, so there must be an/the error and that's the reason why it takes 30sec.

In comparison, fetching from a "large complete" table was done in 0.5 seconds:

SELECT * FROM tbl_all_parameters 
WHERE file_id = 999

After some research, I came across INDEX and thought I had the solution. I created an index (file_id) for each of the parameter tables, but the result was even slower/same.

Right now I´m thinking about building that big "one in all" table, which makes it easier to handle and faster to select from, but I would have to update it frequently to insert new cols for new parameters. And I´m afraid it will grow so big it kills itself

I have 2 questions:

Which table setup is recommended, primary with focus on SELECT speed, secondary with size of DB.
Do I have a basic bug in my SELECT that makes it so slow?

EXPLAIN SELECT

CodePudding user response：

You're getting a combinatorial explosion in your JOIN. Your result set contains one output row for every pair of input rows in your two parameter tables.

If you say

  SELECT * FROM a LEFT JOIN b

with no ON condition at all you get COUNT(a) * COUNT(b) rows in your result set. And you said this

  SELECT * FROM a LEFT JOIN b WHERE a.file_id = b.file_id

which gives you a similarly bloated result set.

You need another ON condition... possibly try this.

SELECT * 
  FROM tbl_parameter_xy as x
  LEFT JOIN tbl_parameter_yz as y
         ON x.file_id = y.file_id
        AND x.timestamp = y.timestamp

if the timestamps in the two tables are somehow in sync.

But, with respect, I don't think you have a very good database design yet. This is a tricky kind of data for which to create an optimal database layout, because it's extensible.

If you find yourself with a design where you routinely create new tables in production (for example, when adding a new device type) you almost certainly have misdesigned you database.

An approach you might take is creating an attribute / value table. It will have a lot of rows in it, but they'll be short and easy to index.

Your observations will go into a table like this.

file_id       part of your primary key 
parameter_id  part of your primary key
timestamp     part of your primary key
value

Then, when you need to, say, retrieve parameters 2 and 3 from a particular file, you would do

    SELECT timestamp, parameter_id, value
      FROM observation_table
     WHERE file_id = xxxx
       AND parameter_id IN (2,3)
     ORDER BY timestamp, parameter_id

The multicolumn primary key I suggested will optimize this particular query.

Once you have this working, read about denormalization.