I'm attempting to use SQL to pull data from a database into a Jupyter (python) notebook and work with it there. I have a query that pulls the yearweek of flight's upload date, and counts the number of flights in that yearweek. Finally, it groups the results by the yearweek of upload date:
SELECT YEARWEEK(d.upload_date), COUNT(f.id)
FROM apps_flight f
LEFT JOIN apps_enginedatafile d ON d.id=f.import_file_id
WHERE f.global_duplicate = 0
GROUP BY YEARWEEK(d.upload_date)
I want to count number of subscribers (located in another table) from each yearweek to compare them to count of flights. So I'm trying to join said table by adding:
LEFT JOIN apps_subscription s ON s.basesubscription_ptr_id = f.id
But, when I do this, the counts of my flight values change! The first few counts for the original query look like:
[327, 605, 78, 5768, 9716, 9686, 7902, 3699, 3323, 6081, 4966, 3456, 3181, 2749, 4577, 3157, 1792, 1806, ...]
After joining the table, it becomes:
[327, 738, 78, 8854, 17418, 16156, 13921, 7536, 5380, 10040, 7559, 5461, 6323, 6412, 6702, 5433, 2924, ...]
I'm not sure what's happening here. Perhaps the join is creating duplicate rows? The data set is very large, and takes about 30 minutes to run the query. Adding a LIMIT doesn't seem to speed it up, so as you can imagine, testing takes a little while. (If I'm oblivious to another way to speed up the query aside from a LIMIT, feel free to make me aware)!
Thanks for any info.
CodePudding user response:
Simply join two aggregate count queries. Below assumes same structure including columns names. (Adjust upload_date
to actual date/time column in apps_subscription
.)
WITH agg_flights AS (
SELECT YEARWEEK(d.upload_date) AS year_week,
COUNT(f.id) AS flight_counts
FROM apps_flight f
LEFT JOIN apps_enginedatafile d
ON d.id = f.import_file_id
WHERE f.global_duplicate = 0
GROUP BY YEARWEEK(d.upload_date)
), agg_subs AS (
SELECT YEARWEEK(s.upload_date) AS year_week, -- ADJUST date/time variable
COUNT(f.id) AS subscriber_counts
FROM apps_flight f
LEFT JOIN apps_subscription s
ON s.basesubscription_ptr_id = f.id
WHERE f.global_duplicate = 0
GROUP BY YEARWEEK(s.upload_date) -- ADJUST date/time variable
)
SELECT f.year_week,
f.flight_counts,
s.subscriber_counts
FROM agg_flights f
INNER JOIN agg_subs s
ON f.year_week = s.year_week
CodePudding user response:
Joins create combined rows of all the tables joined. So your join between f and d will have multiple rows (before the group by) for a single flight if that flight has more than one import_file_id value, and the join on s will add multiple rows if a flight has more than one subscription. And COUNT operates on the result of the joins, not on the f table before the join.
In this case, the easy fix is to just use COUNT(DISTINCT f.id)
instead of COUNT(f.id)
, so each flight is only counted once per yearweek.