I have an SQL question in which I am struggling to understand and find relevant resources to help me.
The question is: "Write an SQL query to identify data inconsistencies between two tables."
I need to compare the following tables of data:
AssetManager
AssetManagerName |
---|
John Doe |
Joe Smith |
Dave Grey |
Lisa Sparks |
Kate Green |
Trip
PropertyCode | AssetManagerName | Date |
---|---|---|
P001 | John Doe | 2022-01-22 |
P001 | Joe Smith | 2022-01-19 |
P002 | Dave Grey | 2022-02-25 |
P002 | John Doe | 2022-04-23 |
P003 | Kate Greens | 2022-02-25 |
P004 | Joe Smith | 2022-05-29 |
P002 | Dave Grey | 2022-01-25 |
P001 | John Doe | 2022-02-24 |
Image translated to text from Original Source
What are the inconsistencies in this case? Is it maybe that "Kate Green" is in the AssetManager table, and you have "Kate Greens" in the Trip table? That's the only thing I can see.
What MySQL commands could I use that would help me to achieve this query?
CodePudding user response:
In SQL, when we talk about inconsistencies, we are generally referring to data that would not correctly translate into a normalised form, when we try to join between tables this would result in missing data or orphaned rows. Commonly inconsistencies arise when there is no referential constraints in a schema to maintain consistency. In such cases simple spelling mistakes can easily creep into the dataset, but entirely wrong values could also be used. In this case, If there is a table that represents all the possible Asset Managers, then we would expect that in other tables that refer to Asset Managers that only values from the Asset Managers table would be used, spelling mistakes and entirely missing names will be treated the same.
In the Trip
Table we can identify inconsistency with the AssetManager
table by looking for any records in Trip
that do not have a match in AssetManager
using the AssetManagerName
column.
One simple way to do this is to use an OUTER JOIN
and to exclude all the matches:
SELECT Trip.*
FROM Trip
LEFT OUTER JOIN AssetManager ON Trip.AssetManagerName = AssetManager.AssetManagerName
WHERE AssetManager.AssetManagerName IS NULL
This returns the following result: (See db-fiddle)
PropertyCode | AssetManagerName | Date |
---|---|---|
P003 | Kate Greens | 2022-02-25 |
The LEFT OUTER JOIN
(or LEFT JOIN
) will return all the rows from the Trip
table, even if there is no corresponding match in the AssetManager
table on the AssetManagerName
column. For the rows that do not match, all the values for the AssetManager
table in the result set will be NULL
.
We can then use a WHERE
clause to exclude all the matches data records and only return those records that DO NOT MATCH, we do this by only allowing where AssetManager.AssetManagerName
has a null value.
There are no records in
Trip
with a legitimate null value in theAssetManagerName
, the null only exists in the recordset at a result of theLEFT OUTER JOIN
evaluation.
You could also use a NOT EXISTS
Clause, this syntax is sometimes easier to read and identify the intent, we want to find the records that DO NOT MATCH. But specifically in MySQL it's execution plan generally less efficient than the LEFT OUTER JOIN
expression above.
SELECT Trip.*
FROM Trip
WHERE NOT EXISTS (
SELECT AssetManager.AssetManagerName
FROM AssetManager
WHERE AssetManager.AssetManagerName = Trip.AssetManagerName
)
Another variation of this is to use NOT IN
. For this query we first evaluate a list of possible values for AssetManagerName
and use that to identify the values that do not match.
This is helpful when there might be some legitimate null values in either of the tables for
AssetManagerName
asIN
handlesNULL
values differently toEXISTS
SELECT Trip.*
FROM Trip
WHERE Trip.AssetManagerName NOT IN (
SELECT AssetManager.AssetManagerName
FROM AssetManager
)
For an interesting analysis of these options and performace considerations have a read over this article: NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL: MySQL