Find the top matches comparing table columns-CodePudding

I have a database with up to 400 tables mixed from different sources. I need to group those tables in an excel file by column similarity (considering that tables have 0, 1, 2, or all columns with the same name). The challenge is as the example follows:

fac.table_1 have columns C1, C2, C3, C4 and C5
dim.table_2 has columns C1, C3, and C5
stg.table_3 has columns C1, C6, and C7
stg.table_4 has columns C2, and C99

...

The expected result should be:

sch_name | table_name | ncols | nmatches
  dim    |   table_2  |   3   |   3
  stg    |   table_3  |   3   |   1
  stg    |   table_4  |   2   |   1

I think the way is to use something like this code together with COUNT or INTERSECT, inserting in WHERE the table name that I want to compare with others:

    SELECT
       schemas.name sch_nm,
       tables.name tb_nm,
       columns.name col_nm
    FROM sys.tables
       LEFT JOIN sys.columns ON tables.object_id = columns.object_id
       LEFT JOIN sys.schemas ON tables.schema_id = schemas.schema_id

CodePudding user response：

You wanted to count the number of columns which the column name exists in another table ?

select sch_name, tbl_name, 
       ncols      = count(*), 
       nmatches   = sum(case when col_cnt > 1 then 1 else 0 end),
       percentage = sum(case when col_cnt > 1 then 1 else 0 end) * 100 / count(*) 
from
(
    select sch_name = s.name, 
           tbl_name = t.name,
           col_name = c.name,
           col_cnt  = count(c.name) over(partition by c.name)
    from   sys.schemas s
           inner join sys.tables t  on s.schema_id = t.schema_id
           inner join sys.columns c on t.object_id = c.object_id
    where  t.name in ('table1', 'table2', 'table3', 'table4')
) c       
where tbl_name not in ('table1')
group by sch_name, tbl_name
order by c.tbl_name;

Result :

sch_name	tbl_name	ncols	nmatches
fac	table_1	5	4
dim	table_2	3	3
stg	table_3	3	1
stg	table_4	2	1

db<>fiddle demo