I'm quite a newbie in bigquery. I have a table MOVIES
with the following schema :
[SchemaField('movieId', 'INTEGER', 'NULLABLE', None, ()),
SchemaField('title', 'STRING', 'NULLABLE', None, ()),
SchemaField('genres', 'STRING', 'NULLABLE', None, ())]
The title of the movies looks like this : Pharaoh's Army (1995)
I want to see in which year most drama were filmed. I already managed to extract the year from the title and only keeping movies that are dramas like so :
q4 = """
select
movies.title,
regexp_extract(title, r'\((\d{4})\)') as year,
from assignment-344206.movie_lens_20M.MOVIES movies
where movies.genres = "Drama"
order by year desc
limit 5
"""
query_job_4 = client.query(q4)
query_job_4.to_dataframe()
Now I need to group by year and then count each group to find the year with the most dramas. I tried this :
q4 = """
select
movies.title,
regexp_extract(title, r'\((\d{4})\)') as year,
count(year) as nb_per_year
from assignment-344206.movie_lens_20M.MOVIES movies
where movies.genres = "Drama"
group by year
order by year desc
limit 5
"""
But I get the following error BadRequest: 400 Unrecognized name: year at [5:9]
which I don't understand. Without the group by
the name year
was recognized in order by year desc
so why doesn't it recognize it now ?
CodePudding user response:
To complement Martin Weitzmann answer, if you decide to go using WITH
, below its a code/script implementation of such scenario. with
can be use as a temp table in a single expression for your final query. For more details about with
behaviour can be found on with clause page.
query: This query shows the usage of with
on your case. You can run this on the BigQuery UI.
with movies as (
select
title,
SAFE_CAST(regexp_extract(title, r'\((\d{4})\)') AS INT64) as year,
genres
from `projectid.datasetid.table`
)
select count(1) as movies_per_year, year
from movies
where genres = "Drama"
group by year
order by year desc
code: Replace QUERY with above query
from google.cloud import bigquery
client = bigquery.Client()
query = """ QUERY """
query_job = client.query(query)
df = query_job.to_dataframe()
print(df.head())
output
year | movier per year
2008 | 1
2003 | 1
2001 | 1
1994 | 2
1993 | 1
...
CodePudding user response:
The aggregator COUNT()
can't know year
because you're only creating it in the line above as a category/group. ORDER BY
gets executed after grouping/aggregations so it can know the field year
.
Rough order of execution is afaik
- FROM input tables
- JOINs on table records
- WHERE fields filter
- GROUP BY aggregate fields
- HAVING aggregates filter
- WINDOW functions
- QUALIFY window fields filter
- DISTINCT result fields
- ORDER BY result fields
- LIMIT and OFFSET result rows
- SELECT (output/print) output fields
(set operations like UNION
only work with query results - not sure where to list them - either they're first or last depending on how you look at it)
Depending on whether there are NULL values in your year
or not you could try COUNT(*)
or COUNT(title)
or even COUNT(regexp_extract(title, r'\((\d{4})\)'))
If you don't want the extra regex but you need the field year
, you can prepare the table in a CTE WITH
and do the grouping in a query referencing the CTE.