SQL Join is changing values of my existing column-CodePudding

I'm attempting to use SQL to pull data from a database into a Jupyter (python) notebook and work with it there. I have a query that pulls the yearweek of flight's upload date, and counts the number of flights in that yearweek. Finally, it groups the results by the yearweek of upload date:

SELECT YEARWEEK(d.upload_date), COUNT(f.id)
FROM apps_flight f
LEFT JOIN apps_enginedatafile d ON d.id=f.import_file_id
WHERE f.global_duplicate = 0
GROUP BY YEARWEEK(d.upload_date)

I want to count number of subscribers (located in another table) from each yearweek to compare them to count of flights. So I'm trying to join said table by adding:

LEFT JOIN apps_subscription s ON s.basesubscription_ptr_id = f.id

But, when I do this, the counts of my flight values change! The first few counts for the original query look like:

[327, 605, 78, 5768, 9716, 9686, 7902, 3699, 3323, 6081, 4966, 3456, 3181, 2749, 4577, 3157, 1792, 1806, ...]

After joining the table, it becomes:

[327, 738, 78, 8854, 17418, 16156, 13921, 7536, 5380, 10040, 7559, 5461, 6323, 6412, 6702, 5433, 2924, ...]

I'm not sure what's happening here. Perhaps the join is creating duplicate rows? The data set is very large, and takes about 30 minutes to run the query. Adding a LIMIT doesn't seem to speed it up, so as you can imagine, testing takes a little while. (If I'm oblivious to another way to speed up the query aside from a LIMIT, feel free to make me aware)!

Thanks for any info.

CodePudding user response：

Simply join two aggregate count queries. Below assumes same structure including columns names. (Adjust upload_date to actual date/time column in apps_subscription.)

WITH agg_flights AS (
    SELECT YEARWEEK(d.upload_date) AS year_week, 
           COUNT(f.id) AS flight_counts
    FROM apps_flight f
    LEFT JOIN apps_enginedatafile d  
         ON d.id = f.import_file_id
    WHERE f.global_duplicate = 0
    GROUP BY YEARWEEK(d.upload_date)
), agg_subs AS (
    SELECT YEARWEEK(s.upload_date) AS year_week,  -- ADJUST date/time variable
           COUNT(f.id) AS subscriber_counts
    FROM apps_flight f
    LEFT JOIN apps_subscription s 
         ON s.basesubscription_ptr_id = f.id
    WHERE f.global_duplicate = 0
    GROUP BY YEARWEEK(s.upload_date)              -- ADJUST date/time variable
)

SELECT f.year_week, 
       f.flight_counts,
       s.subscriber_counts
FROM agg_flights f
INNER JOIN agg_subs s
  ON f.year_week = s.year_week

CodePudding user response：

Joins create combined rows of all the tables joined. So your join between f and d will have multiple rows (before the group by) for a single flight if that flight has more than one import_file_id value, and the join on s will add multiple rows if a flight has more than one subscription. And COUNT operates on the result of the joins, not on the f table before the join.

In this case, the easy fix is to just use COUNT(DISTINCT f.id) instead of COUNT(f.id), so each flight is only counted once per yearweek.