I have a table and multiple rows in bigquery. I want to have a report giving for each week the number of people landing on our download page, the number of download and then the percentage.
I developed this sql request
SELECT Q1.week_of_year, #We display the week
Q1.year, #We display the year
SUM(page_download_views) as page_download_views,
SUM(nb_download) as nb_download_per_week,
(SUM(nb_download) * 100/SUM(page_download_views)) as percentage_of_download #We calculate the percentage between page landing and download
FROM
#We get the number of people clicking on the button download
(SELECT
DISTINCT EXTRACT (WEEK from (PARSE_DATE('%Y%m%d', event_date))) as week_of_year,
EXTRACT (YEAR from (PARSE_DATE('%Y%m%d', event_date))) as year,
count(event_name) as nb_download
FROM `XXXXX-e78a5.analytics_247657392.events_*` WHERE event_name = 'landing_event_download_apk' GROUP BY event_date) AS Q1
RIGHT JOIN
#We get the number of people arriving on the page download
(SELECT
DISTINCT EXTRACT (WEEK from (PARSE_DATE('%Y%m%d', event_date))) as week_of_year,
count(event_name) as page_download_views
FROM `XXXXX-e78a5.analytics_247657392.events_*` WHERE event_name = 'page_download_view' GROUP BY event_date) AS Q2
ON Q1.week_of_year = Q2.week_of_year GROUP BY week_of_year, year order by year ASC;
But i don't know why, one of the line returned is not correct.
The SUM for the week 52, year 2022 instead of returning 529
is returning 1058
as nb_download_per_week and 6138
instead of 3069
for page_download_view
I don't really understand.
When i run a simple sql request test, just to check the number of download views to validate each row, i got the correct number of rows and data, but don't understand why during SUM, I don't have the correct SUM.
What did i do wrong ? Thanks
CodePudding user response:
It seems that the use of right join in the query is resulting in duplication of data. Since you have not provided the sample data, one workaround I can suggest i.e sum before joining that way you will not have data duplication. You can try the below sample query.
Query:
SELECT * from (
(
SELECT
year,
week_of_year,
SUM(nb_download)nb_download #sum of nb_download
FROM
(
#We get the number of people clicking on the button download
SELECT
DISTINCT EXTRACT (WEEK from (PARSE_DATE('%Y%m%d', event_date))) as week_of_year,
EXTRACT (YEAR from (PARSE_DATE('%Y%m%d', event_date))) as year,
count(event_name) as nb_download
FROM `XXXXX-e78a5.analytics_247657392.events_*` WHERE event_name = 'landing_event_download_apk' GROUP BY event_date
)group by year,week_of_year
) AS Q1
RIGHT JOIN
(
SELECT year,week_of_year,
sum(page_download_views)page_download_views #sum of page_download_views
from(
#We get the number of people arriving on the page download
SELECT
DISTINCT EXTRACT (WEEK from (PARSE_DATE('%Y%m%d', event_date))) as week_of_year,
EXTRACT (YEAR from (PARSE_DATE('%Y%m%d', event_date))) as year,
count(event_name) as page_download_views
FROM `XXXXX-e78a5.analytics_247657392.events_*` WHERE event_name = 'page_download_view' GROUP BY event_date
) group by year,week_of_year
)as Q2
on Q1.year = Q2.year
and Q1.week_of_year = Q2.week_of_year
)
Above is a sample, please correct syntax if needed. The goal is to either remove duplication after joining the data or join the data after summation.
If this does not help you to resolve your issue please add some sample data so that we can give an exact solution.