Home > Back-end >  BigQuery - Given a table with 1623 rows (and all are distinct), how to query to get results with a n
BigQuery - Given a table with 1623 rows (and all are distinct), how to query to get results with a n

Time:05-22

New to BigQuery and SQL (and Stack Overflow), spent hours and couldn't find a solution on the web and couldn't figure it out myself. Would really appreciate if someone could shed some light:

Data Source from BigQuery: bigquery-public-data.new_york.citibike_stations

Screenshot of the table named "citibike_stations" showing 1623 rows.

The next screenshot shows that the table has a column named "num_bikes_available", which I used in my query.

Screenshot showing that the "citibike_stations" table has a column named "num_bikes_available"

I queried the following:

SELECT
  station_id,
  num_bikes_available,
  AVG(num_bikes_available) AS avg_num_bikes_available
FROM
  bigquery-public-data.new_york.citibike_stations;

Error Message: SELECT list expression references column station_id which is neither grouped nor aggregated at [2:3]

So I added a "GROUP BY" clause at the end:

SELECT
  station_id,
  num_bikes_available,
  AVG(num_bikes_available) AS avg_num_bikes_available
FROM
  bigquery-public-data.new_york.citibike_stations
GROUP BY
  station_id, num_bikes_available;

The result I got is not what I wanted, which is shown in the following screenshot:

Screenshot of the query result, but not the desired result.

Someone else did the following query and was able to get it right (using a subquery):

SELECT
  station_id,
  num_bikes_available,
  (SELECT AVG(num_bikes_available)
   FROM bigquery-public-data.new_york.citibike_stations) AS avg_num_bikes_available
FROM
  bigquery-public-data.new_york.citibike_stations;

Screenshot of the correct result ↓:

Screenshot of the correct result

Questions:

  1. Why wouldn't it work when the "AVG(num_bikes_available) AS avg_num_bikes_available" is in the "SELECT" statement, as shown in the first set of query?

  2. Why did it work when the "(SELECT AVG(num_bikes_available) FROM bigquery-public-data.new_york.citibike_stations) AS avg_num_bikes_available" is nested in the SELECT statement, as shown in the last set of query? Why does it not ask for aggregation when the "(SELECT AVG(num_bikes_available) FROM bigquery-public-data.new_york.citibike_stations) AS avg_num_bikes_available" is nested within the SELECT statement?

CodePudding user response:

SELECT
  station_id,
  num_bikes_available,
  AVG(num_bikes_available) OVER() AS avg_num_bikes_available
FROM
  bigquery-public-data.new_york.citibike_stations;

CodePudding user response:

You are confusing how aggregation works. I think it would be easier if you tried to express the query in a sentence.

What you want is 'A list of all the available stations along with information about the number of bikes in each station and the average number of bikes available in all stations'.

Note that the average is just one value, the total number of available bikes in all stations divided by the number of stations. This is the reason why the independant query actually works for what you want.

The 'group by' indicates which rows of the initial table will create each group/row of the result. In your case you want the average of the whole table, so when the query is expressed properly there are no groups to be defined.

If your table had multiple entries per city for example, you could do the average grouped by city to find the average number of available bikes in the stations of each city. Note that the select would be able to return the city and the average, but not any other attribute, e.g. the station id as they wouldn't make sense within each group.

  • Related