I've got a database with production data in multiple tables. I want to analyze the history of the units produced and create a timeline. I am doing this in Python (jupyter lab notebook) and using a cloud based MySQL 8.0 database. Neither of the IDs (both strings and integers) is the primary ID in the database and the IDs cannot be assumed to be sequential. My current strategy is to
- First get the IDs from the first event.
- Do a new query with a WHERE IN [previous IDs] cluase.
- Extract ID's from 2.
- Repeat 2-4 until the final stage.
The IDs are not primary keys in any table. This strategy isn't working as in one stage I have over 800 000 IDs that goes into WHERE IN clause and I can't execute it. Bonus question: Should it work, or is there a limitation in how the query can be formed (such as number of characters or length etc.)?
What I wonder is how to execute this? Is there a way to perform this in a better SQL query or should I split this into multiple queries? Can I use some Python tricks to kind of stream the data in multiple parts?
CodePudding user response:
I have over 800 000 IDs that goes into WHERE IN clause
That's way, way too many for IN
.
The best way to handle this kind of volume is to use a temporary table with CREATE TEMPORARY TABLE
and join the tables instead of using IN
. A temporary table can have an index so that can help speed things up for the join.
This may seem like a very heavy operation but actually it's not; mysql is very good at this kind of thing.