Home > Blockchain >  Optomizing a simple query with 70mil rows to fit into Tableau
Optomizing a simple query with 70mil rows to fit into Tableau

Time:10-02

Noobie to SQL. I have a simple query here that is 70 million rows, and my work laptop will not handle the capacity when I import it into Tableau. Usually 20 million rows and less seem to work fine. Here's my problem.

Table name: Table1

Fields: UniqueID, State, Date, claim_type

Query:

SELECT uniqueID, states, claim_type, date

FROM table1

WHERE date >= '11-09-2021'

This gives me what I want, BUT, I can limit the query significantly if I count the number of uniqueIDs that have been used in 3 or more different states. I use this query to do that.

SELECT unique_id, count(distinct states), claim_type, date

FROM table1

WHERE date >= '11-09-2021'

GROUP BY Unique_id, claim_type, date

HAVING COUNT(DISTINCT states) > 3

The only issue is, when I put this query into Tableau it only displays the FIRST state a unique_id showed up in, and the first date it showed up. A unique_id shows up in multiple states over multiple dates, so when I use this count aggregation it's only giving me the first result and not the whole picture.

Any ideas here? I am totally lost and spent a whole business day trying to fix this

Expected output would be something like

uniqueID | state | claim type | Date

123 Ohio C 01-01-2021

123 Nebraska I 02-08-2021

123 Georgia D 03-08-2021

CodePudding user response:

Yes, you are grouping rows, so therefore you 'loose' information on the grouped result. You won't get 70m records with your grouped query.

Why don't you split your imports in smaller chunks? Like limit the rows to chunks of, say 15m:

1st:

SELECT uniqueID, states, claim_type, date FROM table1 WHERE date >= '11-09-2021' LIMIT 15000000;

2nd:

SELECT uniqueID, states, claim_type, date FROM table1 WHERE date >= '11-09-2021' LIMIT 15000000 OFFSET 15000000;

3rd:

SELECT uniqueID, states, claim_type, date FROM table1 WHERE date >= '11-09-2021' LIMIT 15000000 OFFSET 30000000;

and so on..

I know its not a perfect or very handy solution but maybe it gets you to the desired outcome.

See this link for infos about LIMIT and OFFSET https://www.bitdegree.org/learn/mysql-limit-offset

CodePudding user response:

If your table is only of those four columns, and your queries are based on date ranges, your index must exist to help optimize that. If 70 mil records exist, how far back does that go... Years? If your data since 2021-09-11 is only say... 30k records, that should be all you are blowing through for your results.

I would ensure you have the index based on (and in this order) (date, uniqueId, claim_type, states). Also, you mentioned you wanted a count of 3 OR MORE, your query > 3 will results in 4 or more unless you change to count(*) >= 3.

Then, to get the entries you care about, you need

SELECT date, uniqueID, claim_type
   FROM table1
   WHERE date >= '2021-09-11'
   group by date, uniqueID, claim_type
   having count( distinct states ) >= 3

This would give just the 3-part qualifier for date/id/claim that HAD them. Then you would use THIS result set to get the other entries via

select distinct
      date, uniqueID, claim_type, states
   from
      ( SELECT date, uniqueID, claim_type
           FROM table1
           WHERE date >= '2021-09-11'
           group by date, uniqueID, claim_type
           having count( distinct states ) >= 3 ) PQ
         JOIN Table1 t1
            on PQ.date = t1.date
           and PQ.UniqueID = t1.UniqueID
           and PQ.Claim_Type = t1.Claim_Type

The "PQ" (preQuery) gets the qualified records. Then it joins back to the original table and grabs all records that qualified from the unique date/id/claim_type and returns all the states.

  • Related