The question is about *.FIT files (link to definition) (1 to extremely many and constantly more), from Sports watches, speedometers, in which there is always a timestamp (1 to n seconds), as well as 1 to n further parameters (which also have either a timestamp or a counter from 1 to x).
To perform data analysis, I need the data in the database to calculate e.g. the heart rates in relation to the altitude over several FIT files / training units / time periods.
Because of the changing number of parameters in a FIT file (depending on the connected devices, the device that created the file, etc.) and the possibility to integrate more/new parameters in the future, my idea was to have a separate table for each parameter instead of writing everything in one big table (which would then have extremely many "empty" cells whenever a parameter is not present in a FIT file).
Basic tables:
1 x tbl_file
id | filename | date |
---|---|---|
1 | xyz.fit | 2022-01-01 |
2 | vwx.fit | 2022-01-02 |
.. | .. | .. |
n x tbl_parameter_xy / tbl_ parameter_yz / ....
id | timestamp/counter | file_id | value |
---|---|---|---|
1 | 0 | 1 | value |
2 | 1 | 1 | value |
3 | 0 | 2 | value |
.. | .. | .. | .. |
And these parameter tables would then be linked to each other via the file_id as well as to the FIT File.
I then used a test server, set up a MYSQL-DB to test this and was shocked:
SELECT * FROM tbl_parameter_xy as x
LEFT JOIN tbl_parameter_yz as y
ON x.file_id = y.file_id
WHERE x.file_id = 999
Took almost 30 seconds to give me the results.
In my parameter tables there are 209918 rows. file_id 999 consists of 1964 rows. But my SELECT with JOIN returns 3857269 rows, so there must be an/the error and that's the reason why it takes 30sec.
In comparison, fetching from a "large complete" table was done in 0.5 seconds:
SELECT * FROM tbl_all_parameters
WHERE file_id = 999
After some research, I came across INDEX and thought I had the solution. I created an index (file_id) for each of the parameter tables, but the result was even slower/same.
Right now I´m thinking about building that big "one in all" table, which makes it easier to handle and faster to select from, but I would have to update it frequently to insert new cols for new parameters. And I´m afraid it will grow so big it kills itself
I have 2 questions:
- Which table setup is recommended, primary with focus on SELECT speed, secondary with size of DB.
- Do I have a basic bug in my SELECT that makes it so slow?
CodePudding user response:
You're getting a combinatorial explosion in your JOIN. Your result set contains one output row for every pair of input rows in your two parameter tables.
If you say
SELECT * FROM a LEFT JOIN b
with no ON condition at all you get COUNT(a) * COUNT(b) rows in your result set. And you said this
SELECT * FROM a LEFT JOIN b WHERE a.file_id = b.file_id
which gives you a similarly bloated result set.
You need another ON condition... possibly try this.
SELECT *
FROM tbl_parameter_xy as x
LEFT JOIN tbl_parameter_yz as y
ON x.file_id = y.file_id
AND x.timestamp = y.timestamp
if the timestamps in the two tables are somehow in sync.
But, with respect, I don't think you have a very good database design yet. This is a tricky kind of data for which to create an optimal database layout, because it's extensible.
If you find yourself with a design where you routinely create new tables in production (for example, when adding a new device type) you almost certainly have misdesigned you database.
An approach you might take is creating an attribute / value table. It will have a lot of rows in it, but they'll be short and easy to index.
Your observations will go into a table like this.
file_id part of your primary key
parameter_id part of your primary key
timestamp part of your primary key
value
Then, when you need to, say, retrieve parameters 2 and 3 from a particular file, you would do
SELECT timestamp, parameter_id, value
FROM observation_table
WHERE file_id = xxxx
AND parameter_id IN (2,3)
ORDER BY timestamp, parameter_id
The multicolumn primary key I suggested will optimize this particular query.
Once you have this working, read about denormalization.