Home > OS >  Need a SQL query explained
Need a SQL query explained

Time:08-25

I'm learning the databricks platform at the moment, and I'm on a lesson where we are talking about CTE's. This specific query is of a CTE in a CTE definition, and the girl in the video is not doing the best job breaking down what exactly this query is doing.

WITH lax_bos AS (
  WITH origin_destination (origin_airport, destination_airport) AS (
    SELECT
      origin,
      destination
    FROM
      external_table
  )
  SELECT
    *
  FROM
    origin_destination
  WHERE
    origin_airport = 'LAX'
    AND destination_airport = 'BOS'
)
SELECT
  count(origin_airport) AS `Total Flights from LAX to BOS`
FROM
  lax_bos;

the output of the query comes out to 684 which I know comes from the last select statement, It's just mostly everything that's going on above, I don't fully understand what's happening.

CodePudding user response:

at first you choose 2 needed columns from external_table and name this cte "origin_destination" :

   SELECT
      origin,
      destination
    FROM
      external_table

next you filter it in another cte named "lax_bos"

SELECT
    *
  FROM
    origin_destination ------the cte you already made
  WHERE
    origin_airport = 'LAX'
    AND destination_airport = 'BOS'

and this is the main query where you use cte "lax_bos" that you made in previous step, here you just count a number of flights:

SELECT
  count(origin_airport) AS `Total Flights from LAX to BOS`
FROM
  lax_bos 

CodePudding user response:

Nesting CTE's is wierd. Normally they form a single-level transformation pipeline, like this:

WITH origin_destination (origin_airport, destination_airport) AS 
(
  SELECT origin, destination
  FROM external_table
), lax_bos AS 
(
  SELECT *
  FROM origin_destination
  WHERE origin_airport = 'LAX'
    AND destination_airport = 'BOS'
)
SELECT count(origin_airport) AS `Total Flights from LAX to BOS`
FROM lax_bos;

CodePudding user response:

I do not understand why you are using an common table expression (cte).

I am going to give you a quick overview of how this can be done without an cte.

Always, use some type of sample data set. There are plenty that are installed with databricks. In fact, there is one for delayed airplane departures.

enter image description here

The next step is to read in the file and convert it to a temporary view.

enter image description here

At this point, we can use the Spark SQL magic command to query the data.

enter image description here

The query shows plane flights from LAX to BOS. We can remove the limit 10 option and change the '*' to "count(*) as Total" to get your answer. Thus, we solved the problem without a CTE.

enter image description here

The above image uses a CTE to pull the origin, destination and delay for all flights from LAX to BOS. Then it bins the delays from -9 to 9 hours with counts.

Again, this can all be done in one SQL statement that might be cleaner.

I reserve CTE for more complex situations. For instance, calculating a complex math formula using a range of data and paring it with the base data set.

CodePudding user response:

CTE can be recursive query, or subquery. Here, they are only simple subquery.

1st, the query origin_destination is done. Second, the query lax_bos is done over origin_destination result. And then, the final query is done on lax_bos result.

  • Related