I have a database with up to 400 tables mixed from different sources. I need to group those tables in an excel file by column similarity (considering that tables have 0, 1, 2, or all columns with the same name). The challenge is as the example follows:
fac.table_1 have columns C1, C2, C3, C4 and C5
dim.table_2 has columns C1, C3, and C5
stg.table_3 has columns C1, C6, and C7
stg.table_4 has columns C2, and C99
...
The expected result should be:
sch_name | table_name | ncols | nmatches
dim | table_2 | 3 | 3
stg | table_3 | 3 | 1
stg | table_4 | 2 | 1
I think the way is to use something like this code together with COUNT or INTERSECT, inserting in WHERE the table name that I want to compare with others:
SELECT
schemas.name sch_nm,
tables.name tb_nm,
columns.name col_nm
FROM sys.tables
LEFT JOIN sys.columns ON tables.object_id = columns.object_id
LEFT JOIN sys.schemas ON tables.schema_id = schemas.schema_id
CodePudding user response:
You wanted to count the number of columns which the column name exists in another table ?
select sch_name, tbl_name,
ncols = count(*),
nmatches = sum(case when col_cnt > 1 then 1 else 0 end),
percentage = sum(case when col_cnt > 1 then 1 else 0 end) * 100 / count(*)
from
(
select sch_name = s.name,
tbl_name = t.name,
col_name = c.name,
col_cnt = count(c.name) over(partition by c.name)
from sys.schemas s
inner join sys.tables t on s.schema_id = t.schema_id
inner join sys.columns c on t.object_id = c.object_id
where t.name in ('table1', 'table2', 'table3', 'table4')
) c
where tbl_name not in ('table1')
group by sch_name, tbl_name
order by c.tbl_name;
Result :
sch_name | tbl_name | ncols | nmatches |
---|---|---|---|
fac | table_1 | 5 | 4 |
dim | table_2 | 3 | 3 |
stg | table_3 | 3 | 1 |
stg | table_4 | 2 | 1 |