I have a database that was given to us without constraints, no PK or FK.
I'm trying to reverse engineer to find out connections between tables. In order to do so my approach is to create a list of all columns with the same name and data type. So I created this query:
SELECT schema_name(tab.schema_id) AS schema_name
,tab.name AS table_name
,col.name AS column_name
,t.name AS data_type
,SUM([Partitions].[rows]) AS [TotalRowCount]
FROM sys.tables AS tab
INNER JOIN sys.columns AS col ON tab.object_id = col.object_id
LEFT JOIN sys.types AS t ON col.user_type_id = t.user_type_id
JOIN sys.partitions AS [Partitions] ON tab.[object_id] = [Partitions].[object_id]
AND [Partitions].index_id IN (
0
,1
)
GROUP BY schema_name(tab.schema_id)
,tab.name
,col.name
,t.name
ORDER BY col.name
As result I have this:
schema_name | table_name | column_name | data_type | TotalRowCount |
---|---|---|---|---|
H513WEB | ALPRETAR | IDREVI | numeric | 1439 |
H513WEB | ALPRETAR | IDREVN | numeric | 1439 |
H513WEB | CPDOENP | IDRLPP | numeric | 4156 |
H513WEB | ALPRETAR | IDRNG | numeric | 1439 |
H513WEB | FAAFACP | IDROLDEBCOM | numeric | 100541 |
H513WEB | FAAFACP_OLD | IDROLDEBCOM | numeric | 1513 |
H513WEB | CACPTAP | IDROLE | numeric | 22109 |
H513WEB | CARROLP | IDROLE | numeric | 5 |
H513WEB | FAENTFP | IDROLE | numeric | 26084 |
H513WEB | CACPTCP | IDRUBRIQUE | numeric | 48 |
H513WEB | CACPTRP | IDRUBRIQUE | numeric | 12942 |
H513WEB | CAIMDCP | IDRUBRIQUE | numeric | 6637 |
H513WEB | CAIMGIP | IDRUBRIQUE | numeric | 5 |
H513WEB | CAIMRSP | IDRUBRIQUE | numeric | 40 |
H513WEB | CAOREXP | IDRUBRIQUE | numeric | 5 |
H513WEB | CAPLARP | IDRUBRIQUE | numeric | 3 |
H513WEB | CARERUP | IDRUBRIQUE | numeric | 175 |
H513WEB | CARUBRP | IDRUBRIQUE | numeric | 312 |
but I only would like to list rows that have the same column_name
and data_type
and the result should look like this:
schema_name | table_name | column_name | data_type | TotalRowCount |
---|---|---|---|---|
H513WEB | FAAFACP | IDROLDEBCOM | numeric | 100541 |
H513WEB | FAAFACP_OLD | IDROLDEBCOM | numeric | 1513 |
H513WEB | CACPTAP | IDROLE | numeric | 22109 |
H513WEB | CARROLP | IDROLE | numeric | 5 |
H513WEB | FAENTFP | IDROLE | numeric | 26084 |
H513WEB | CACPTCP | IDRUBRIQUE | numeric | 48 |
H513WEB | CACPTRP | IDRUBRIQUE | numeric | 12942 |
H513WEB | CAIMDCP | IDRUBRIQUE | numeric | 6637 |
H513WEB | CAIMGIP | IDRUBRIQUE | numeric | 5 |
H513WEB | CAIMRSP | IDRUBRIQUE | numeric | 40 |
H513WEB | CAOREXP | IDRUBRIQUE | numeric | 5 |
H513WEB | CAPLARP | IDRUBRIQUE | numeric | 3 |
H513WEB | CARERUP | IDRUBRIQUE | numeric | 175 |
H513WEB | CARUBRP | IDRUBRIQUE | numeric | 312 |
How to do that?
CodePudding user response:
If you're trying to identify PK and FK candidates, it's not just the data type that needs to match, but the length, scale and precision too. A varchar(10)
can't be a foreign key of a varchar(8)
for example. In actuality the name doesn't need to match at all, but you would end up with a terrible number of false positives without including that.
I've guessed a little at what you want but i think this might be along the right lines. The (estimated) row count, on the other hand, doesn't seem necessary at all, so I've removed that from the query:
WITH ColumnCount AS(
SELECT s.name AS SchemaName,
t.name AS TableName,
c.name AS ColumnName,
ct.[name] AS DataType,
c.max_length,
c.precision,
c.scale,
COUNT(c.column_id) OVER (PARTITION BY c.[name], ct.[name], c.max_length, c.precision, c.scale) AS Duplicates
FROM sys.schemas s
JOIN sys.tables t ON s.schema_id = t.schema_id
JOIN sys.columns c ON t.object_id = c.object_id
JOIN sys.types ct ON c.user_type_id = ct.user_type_id)
SELECT *
FROM ColumnCount CC
WHERE CC.Duplicates > 1
ORDER BY CC.ColumnName,
CC.SchemaName,
CC.TableName;