Home > OS >  How to get metadata from columns that have specific number of distinct values?
How to get metadata from columns that have specific number of distinct values?

Time:11-23

I need to find all columns that have 5 or more distinct values. Now my query is like:

 SELECT TABLE_NAME,COLUMN_NAME, DATA_TYPE
 FROM INFORMATION_SCHEMA.COLUMNS
 WHERE TABLE_SCHEMA = 'MY_SCHEMA' 
 AND   TABLE_NAME IN ('TABLE_1', 'TABLE_2', 'TABLE_3')

I thought it could be done like simple subquery. Something like:

*code above*
AND (select count(distinct COLUMN_NAME) FROM TABLE_SCHEMA   TABLE_NAME) > 5

I just recently started to learn SQL and thought this kind of thing is easy, but still I can't figure out right query.

CodePudding user response:

With help of Stu's answer and this answer I was able to make workable solution.

declare @RowsToProcess  int
declare @CurrentRow     int
declare @SelectCol     nvarchar(max)
declare @SelectTable     nvarchar(max)

declare @tablesAndColumns table(RowID int not null primary key identity(1,1), table_name nvarchar(max), column_name nvarchar(max)

insert into @tablesAndColumns  
select TABLE_NAME,COLUMN_NAME,DATA_TYPE
from INFORMATION_SCHEMA.COLUMNS
where TABLE_SCHEMA = 'my schema' 
and   TABLE_NAME in ('myTable', 'myTable2' ,'myTable3')

set @RowsToProcess=@@ROWCOUNT

set @CurrentRow=0
while @CurrentRow<@RowsToProcess
begin
    set @CurrentRow=@CurrentRow 1
    select 
        @SelectCol=column_name,
        @SelectTable=table_name
        from @tablesAndColumns
        where RowID=@CurrentRow

    declare @QRY NVARCHAR(MAX)
    set @QRY = ' insert into [my_schema].[result_table] (table_name,column_name,distinct_values) 
                 SELECT '   ''''  @SelectTable  ''''   ', '    ''''  @SelectCol  ''''       ', count(*) as cnt  
                 FROM (SELECT DISTINCT '  @SelectCol  ' FROM my_schema.'  @SelectTable ') as a'
    exec SP_EXECUTESQL @QRY

end

CodePudding user response:

I'd like to propose another way. You can run through all the column and table names by using a CURSOR. That way you don't need to store them beforehand and can directly access them in your loop while also having a while condition.

Also I went with sys.tables and sys.columns since I noticed that INFORMATION_SCHEMA also contains views and sys.tables can be filtered for the table's type.

I added a "HAVING COUNT(*) >= 5" into the dynamic SQL so I don't save those informations in the first place rather than filtering them later.

Finally I went with "(NOLOCK)" because you only try to acces the tables for reading and that way you don't lock them for other users / interactions.

(The @i and @max are just for tracking the progress since I ran the query on ~10k columns and just wanted to see how far it is.)

Hopefully might be helpful aswell although you seem to have solved your problem.

DECLARE @columnName nvarchar(100),
    @tableName nvarchar(100),
    @sql nvarchar(MAX),
    @i int = 0,
    @max int = (SELECT COUNT(*)
                FROM sys.tables T
                INNER JOIN sys.columns C ON T.object_id = C.object_id
                WHERE T.[type] = 'U')

DROP TABLE IF EXISTS #resultTable

CREATE TABLE #resultTable (ColumnName nvarchar(100), TableName nvarchar(100), ResultCount int)

DECLARE db_cursor CURSOR FOR
SELECT C.[name], T.[name]
FROM sys.tables T
INNER JOIN sys.columns C ON T.object_id = C.object_id
WHERE T.[type] = 'U'

OPEN db_cursor  
FETCH NEXT FROM db_cursor INTO @columnName, @tableName

WHILE @@FETCH_STATUS = 0  
BEGIN  
    SET @sql = CONCAT(' INSERT INTO #resultTable (ColumnName, TableName, ResultCount)
                        SELECT ''', @columnName, ''', ''', @tableName, ''', COUNT(*)
                        FROM (
                            SELECT DISTINCT [', @columnName, ']
                            FROM [', @tableName, '] (NOLOCK)
                            WHERE [', @columnName, '] IS NOT NULL
                        ) t
                        HAVING COUNT(*) >= 5')

    EXEC sp_executesql @sql

    SET @i = @i   1
    PRINT CONCAT(@i, ' / ', @max)

    FETCH NEXT FROM db_cursor INTO @columnName, @tableName
END 

CLOSE db_cursor  
DEALLOCATE db_cursor 

SELECT *
FROM #resultTable
  • Related