Home > database >  SQL Subquery Join and Sum don't return correct result
SQL Subquery Join and Sum don't return correct result

Time:01-04

I have a table and multiple rows in bigquery. I want to have a report giving for each week the number of people landing on our download page, the number of download and then the percentage.

I developed this sql request

SELECT Q1.week_of_year, #We display the week
       Q1.year, #We display the year
       SUM(page_download_views) as page_download_views,
       SUM(nb_download) as nb_download_per_week,
       (SUM(nb_download) * 100/SUM(page_download_views)) as percentage_of_download #We calculate the percentage between page landing and download
       
FROM

          #We get the number of people clicking on the button download
          (SELECT 
              DISTINCT EXTRACT (WEEK from (PARSE_DATE('%Y%m%d', event_date))) as week_of_year,
              EXTRACT (YEAR from (PARSE_DATE('%Y%m%d', event_date))) as year,
              count(event_name) as nb_download
          FROM `XXXXX-e78a5.analytics_247657392.events_*` WHERE event_name = 'landing_event_download_apk'  GROUP BY event_date) AS Q1

          RIGHT JOIN

          #We get the number of people arriving on the page download
          (SELECT 
              DISTINCT EXTRACT (WEEK from (PARSE_DATE('%Y%m%d', event_date))) as week_of_year,
              count(event_name) as page_download_views
          FROM `XXXXX-e78a5.analytics_247657392.events_*` WHERE event_name = 'page_download_view'  GROUP BY event_date) AS Q2
    
    

ON Q1.week_of_year = Q2.week_of_year GROUP BY week_of_year, year order by year ASC;

But i don't know why, one of the line returned is not correct. enter image description here

The SUM for the week 52, year 2022 instead of returning 529 is returning 1058 as nb_download_per_week and 6138 instead of 3069 for page_download_view

I don't really understand.

When i run a simple sql request test, just to check the number of download views to validate each row, i got the correct number of rows and data, but don't understand why during SUM, I don't have the correct SUM.

enter image description here

What did i do wrong ? Thanks

CodePudding user response:

It seems that the use of right join in the query is resulting in duplication of data. Since you have not provided the sample data, one workaround I can suggest i.e sum before joining that way you will not have data duplication. You can try the below sample query.

Query:

SELECT * from (
(
SELECT
year,
week_of_year,
SUM(nb_download)nb_download #sum of nb_download
FROM
 
(
 #We get the number of people clicking on the button download
 SELECT
   DISTINCT EXTRACT (WEEK from (PARSE_DATE('%Y%m%d', event_date))) as week_of_year,
   EXTRACT (YEAR from (PARSE_DATE('%Y%m%d', event_date))) as year,
   count(event_name) as nb_download
FROM `XXXXX-e78a5.analytics_247657392.events_*` WHERE event_name = 'landing_event_download_apk'  GROUP BY event_date
)group by year,week_of_year
) AS Q1
 
RIGHT JOIN
 
(
 
SELECT year,week_of_year,
sum(page_download_views)page_download_views #sum of page_download_views
from(
 #We get the number of people arriving on the page download
 SELECT
     DISTINCT EXTRACT (WEEK from (PARSE_DATE('%Y%m%d', event_date))) as week_of_year,
     EXTRACT (YEAR from (PARSE_DATE('%Y%m%d', event_date))) as year,
     count(event_name) as page_download_views
 FROM `XXXXX-e78a5.analytics_247657392.events_*` WHERE event_name = 'page_download_view'  GROUP BY event_date
 ) group by year,week_of_year
 
)as Q2
on  Q1.year = Q2.year
and Q1.week_of_year = Q2.week_of_year
)

Above is a sample, please correct syntax if needed. The goal is to either remove duplication after joining the data or join the data after summation.

If this does not help you to resolve your issue please add some sample data so that we can give an exact solution.

  • Related