I don't understand why in the below query there has to be reference to zones table twice (as zpu and zdo). When I try modified query with one "zones" table instead of splitting it into zpu and zdo I have different outcome. The outcome table has to have two columns, one with pickup location and second with dropoff location. Locations are contained withing "zones" table.
Can't figure out why is that and what am I lacking in SQL mechanics.
Tables:
Trips
tpep_pickup_datetime | tpep_dropoff_datetime | total_amount | PULocationID | DOLocationID |
---|---|---|---|---|
"2021-01-01 00:30:10" | "2021-01-01 00:36:12" | 11.8 | 142 | 43 |
"2021-01-01 00:51:20" | "2021-01-01 00:52:19" | 4.3 | 238 | 151 |
"2021-01-01 00:43:30" | "2021-01-01 01:11:06" | 51.95 | 132 | 165 |
"2021-01-01 00:15:48" | "2021-01-01 00:31:01" | 36.35 | 138 | 132 |
"2021-01-01 00:31:49" | "2021-01-01 00:48:21" | 24.36 | 68 | 33 |
"2021-01-01 00:16:29" | "2021-01-01 00:24:30" | 14.15 | 224 | 68 |
"2021-01-01 00:00:28" | "2021-01-01 00:17:28" | 17.3 | 95 | 157 |
"2021-01-01 00:12:29" | "2021-01-01 00:30:34" | 21.8 | 90 | 40 |
Zones
LocationID | Borough | Zone |
---|---|---|
1 | EWR | Newark Airport |
2 | Queens | Jamaica Bay |
3 | Bronx | Allerton/Pelham Gardens |
4 | Manhattan | Alphabet City Yellow Zone |
5 | Staten Island | Arden Heights |
6 | Staten Island | Arrochar/Fort Wadsworth |
7 | Queens | Astoria |
8 | Queens | Astoria Park |
9 | Queens | Auburndale |
10 | Queens | Baisley Park |
Query
select
tpep_pickup_datetime,
tpep_dropoff_datetime,
total_amount,
CONCAT(zpu."Borough", ' / ', zpu."Zone") AS "pick_up_loc",
CONCAT(zdo."Borough", ' / ', zpu."Zone") AS "drop_off_loc"
from
trips t,
zones zpu,
zones zdo
WHERE
t."PULocationID" = zpu."LocationID" AND
t."DOLocationID" = zdo."LocationID"
Outcome (good table):
tpep_pickup_datetime | tpep_dropoff_datetime | total_amount | pick_up_loc | drop_off_loc |
---|---|---|---|---|
2021-01-01 00:30:10 | 2021-01-01 00:36:12 | 11.8 | Manhattan / Lincoln Square East | Manhattan / Lincoln Square East |
2021-01-01 00:51:20 | 2021-01-01 00:52:19 | 4.3 | Manhattan / Upper West Side North | Manhattan / Upper West Side North |
2021-01-01 00:43:30 | 2021-01-01 01:11:06 | 51.95 | Queens / JFK Airport | Brooklyn / JFK Airport |
2021-01-01 00:15:48 | 2021-01-01 00:31:01 | 36.35 | Queens / LaGuardia Airport | Queens / LaGuardia Airport |
2021-01-01 00:31:49 | 2021-01-01 00:48:21 | 24.36 | Manhattan / East Chelsea | Brooklyn / East Chelsea |
2021-01-01 00:16:29 | 2021-01-01 00:24:30 | 14.15 | Manhattan / Stuy Town/Peter Cooper Village | Manhattan / Stuy Town/Peter Cooper Village |
2021-01-01 00:00:28 | 2021-01-01 00:17:28 | 17.3 | Queens / Forest Hills | Queens / Forest Hills |
2021-01-01 00:12:29 | 2021-01-01 00:30:34 | 21.8 | Manhattan / Flatiron | Brooklyn / Flatiron |
2021-01-01 00:39:16 | 2021-01-01 01:00:13 | 28.8 | Brooklyn / Fort Greene | Queens / Fort Greene |
2021-01-01 00:26:12 | 2021-01-01 00:39:46 | 18.95 | Manhattan / Yorkville West | Manhattan / Yorkville West |
CodePudding user response:
You select trips and join addresses, and you wonder why you are forced to join the addresses again and why it doesn't suffice to tell the DBMS once to get you the addresses.
This misunderstanding is probably bolstered by the ancient join syntax you are using - a syntax used in the 1980s mainly until explicit joins made it into the SQL standard in 1992.
When you join tables, you actuall join table rows.
With select * from trips, zones
you join every trip row with every zone row. With eight trip rows and 10 zones you get 80 result rows. So you joined all ten zones to each trip. With select * from trips t, zones zpu, zones zdo
you get a result of 800 rows, each trip combined with all ten zones and these again combined with all ten zones, i.e. all 100 possible zone combinations per trip.
In your query you have WHERE t.pulocationid = zpu.locationid AND t.dolocationid = zdo.locationid
. For one trip there is only one of its 100 intermediate result rows that match this criteria and the other 99 are dismissed.
Now, if you joined the zones table only once, you had ten rows per trip in your intermediate result, each row would contain one zone. Now to your criteria, there are two options: use AND
or OR
:
select * from trips t, zones z
where t."PULocationID" = z."LocationID" AND t."DOLocationID" = z."LocationID"
This would only select rows for which t.pulocationid = t.dolocationid, because the criteria demand them to both match the one zone in the joined row. In your sample data there is not a single trip where the two locations are equal. You would get an empty result set.
select * from trips t, zones z
where t."PULocationID" = z."LocationID" OR t."DOLocationID" = z."LocationID"
With this query you would get the row where the zone matches the t.pulocationid and the row where the zone matches the t.dolocationid. I.e. you would get two result rows per trip, one with the pickup zone, one with the drop off zone.
But you want one result row per trip with both zones in that same result row.
Explicit joins make this more readable:
select
tpep_pickup_datetime,
tpep_dropoff_datetime,
total_amount,
CONCAT(zpu."Borough", ' / ', zpu."Zone") AS "pick_up_loc",
CONCAT(zdo."Borough", ' / ', zpu."Zone") AS "drop_off_loc"
from trips t
inner join zones zpu on zpu."LocationID" = t."PULocationID"
inner join zones zdo on zdo."LocationID" = t."DOLocationID";
This does exactly the same thing as your query, but it is easier to read. Take a trip, join the pickup location row, join the drop off location row.