Home > Software design >  Bigquery : COUNT/GROUP BY after regex extraction
Bigquery : COUNT/GROUP BY after regex extraction

Time:03-18

I'm quite a newbie in bigquery. I have a table MOVIES with the following schema :

[SchemaField('movieId', 'INTEGER', 'NULLABLE', None, ()),
 SchemaField('title', 'STRING', 'NULLABLE', None, ()),
 SchemaField('genres', 'STRING', 'NULLABLE', None, ())]

The title of the movies looks like this : Pharaoh's Army (1995)

I want to see in which year most drama were filmed. I already managed to extract the year from the title and only keeping movies that are dramas like so :

q4 = """
select 
  movies.title,
  regexp_extract(title, r'\((\d{4})\)') as year,
from assignment-344206.movie_lens_20M.MOVIES movies
where movies.genres = "Drama"
order by year desc
limit 5
"""
query_job_4 = client.query(q4)
query_job_4.to_dataframe()

Now I need to group by year and then count each group to find the year with the most dramas. I tried this :

q4 = """
select 
  movies.title,
  regexp_extract(title, r'\((\d{4})\)') as year,
  count(year) as nb_per_year
from assignment-344206.movie_lens_20M.MOVIES movies
where movies.genres = "Drama"
group by year
order by year desc
limit 5
"""

But I get the following error BadRequest: 400 Unrecognized name: year at [5:9] which I don't understand. Without the group bythe name year was recognized in order by year descso why doesn't it recognize it now ?

CodePudding user response:

To complement Martin Weitzmann answer, if you decide to go using WITH, below its a code/script implementation of such scenario. with can be use as a temp table in a single expression for your final query. For more details about with behaviour can be found on with clause page.

query: This query shows the usage of with on your case. You can run this on the BigQuery UI.

with movies as (
    select 
        title,
        SAFE_CAST(regexp_extract(title, r'\((\d{4})\)') AS INT64) as year,
        genres
        from `projectid.datasetid.table`
)
select count(1) as movies_per_year, year
from movies
where genres = "Drama"
group by year
order by year desc

code: Replace QUERY with above query

from google.cloud import bigquery

client = bigquery.Client()

query = """ QUERY """

query_job = client.query(query)
df = query_job.to_dataframe()
print(df.head())

output

year | movier per year
2008 | 1
2003 | 1
2001 | 1
1994 | 2
1993 | 1
  ...

CodePudding user response:

The aggregator COUNT() can't know year because you're only creating it in the line above as a category/group. ORDER BY gets executed after grouping/aggregations so it can know the field year.

Rough order of execution is afaik

  1. FROM input tables
  2. JOINs on table records
  3. WHERE fields filter
  4. GROUP BY aggregate fields
  5. HAVING aggregates filter
  6. WINDOW functions
  7. QUALIFY window fields filter
  8. DISTINCT result fields
  9. ORDER BY result fields
  10. LIMIT and OFFSET result rows
  11. SELECT (output/print) output fields

(set operations like UNION only work with query results - not sure where to list them - either they're first or last depending on how you look at it)

Depending on whether there are NULL values in your year or not you could try COUNT(*) or COUNT(title) or even COUNT(regexp_extract(title, r'\((\d{4})\)'))

If you don't want the extra regex but you need the field year, you can prepare the table in a CTE WITH and do the grouping in a query referencing the CTE.

  • Related