Edit 1: so the issue is '<=' is acting as '<' in google query which is strange. But '>=' acts normally. Any idea why this is happening?
Goal: to get data for May 2019.
Query 2 uses timestamp >= '2019-05-01' AND timestamp <= '2019-05-31'
SELECT file.project AS package, COUNT(file.project) AS installs, FORMAT_DATETIME('%Y-%m', timestamp) AS month
FROM `bigquery-public-data.pypi.file_downloads`
WHERE timestamp >= '2019-05-01' AND timestamp <= '2019-05-31'
GROUP BY month, package;
Both query one and two should scan same amount of data - May 2019 but both query gives different results and scans different amount of data as you can see in attached images.
Which one is correct and why both are not matching?
CodePudding user response:
The two filters are different, you can simply check the difference in the result by the below script.
Differences
SELECT timestamp, FORMAT_DATETIME('%Y-%m', timestamp) AS month
FROM `bigquery-public-data.pypi.file_downloads`
WHERE
timestamp > '2019-04-30' AND timestamp < '2019-06-01'
AND NOT (timestamp >= '2019-05-01' AND timestamp <= '2019-05-31')
;
Results
Personal Preference
SELECT file.project AS package, COUNT(file.project) AS installs, FORMAT_DATETIME('%Y-%m', timestamp) AS month
FROM `bigquery-public-data.pypi.file_downloads`
WHERE timestamp BETWEEN '2019-05-01' AND '2019-05-31'
p.s. As you can check out in the
SELECT EXTRACT(MONTH FROM timestamp) month, COUNT(1) cnt
FROM `bigquery-public-data.pypi.file_downloads`
WHERE timestamp BETWEEN '2019-05-01' AND '2019-06-01' -- scan 22.57 GB
GROUP BY 1