I need to find all columns that have 5 or more distinct values. Now my query is like:
SELECT TABLE_NAME,COLUMN_NAME, DATA_TYPE
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_SCHEMA = 'MY_SCHEMA'
AND TABLE_NAME IN ('TABLE_1', 'TABLE_2', 'TABLE_3')
I thought it could be done like simple subquery. Something like:
*code above*
AND (select count(distinct COLUMN_NAME) FROM TABLE_SCHEMA TABLE_NAME) > 5
I just recently started to learn SQL and thought this kind of thing is easy, but still I can't figure out right query.
CodePudding user response:
With help of Stu's answer and this answer I was able to make workable solution.
declare @RowsToProcess int
declare @CurrentRow int
declare @SelectCol nvarchar(max)
declare @SelectTable nvarchar(max)
declare @tablesAndColumns table(RowID int not null primary key identity(1,1), table_name nvarchar(max), column_name nvarchar(max)
insert into @tablesAndColumns
select TABLE_NAME,COLUMN_NAME,DATA_TYPE
from INFORMATION_SCHEMA.COLUMNS
where TABLE_SCHEMA = 'my schema'
and TABLE_NAME in ('myTable', 'myTable2' ,'myTable3')
set @RowsToProcess=@@ROWCOUNT
set @CurrentRow=0
while @CurrentRow<@RowsToProcess
begin
set @CurrentRow=@CurrentRow 1
select
@SelectCol=column_name,
@SelectTable=table_name
from @tablesAndColumns
where RowID=@CurrentRow
declare @QRY NVARCHAR(MAX)
set @QRY = ' insert into [my_schema].[result_table] (table_name,column_name,distinct_values)
SELECT ' '''' @SelectTable '''' ', ' '''' @SelectCol '''' ', count(*) as cnt
FROM (SELECT DISTINCT ' @SelectCol ' FROM my_schema.' @SelectTable ') as a'
exec SP_EXECUTESQL @QRY
end
CodePudding user response:
I'd like to propose another way. You can run through all the column and table names by using a CURSOR. That way you don't need to store them beforehand and can directly access them in your loop while also having a while condition.
Also I went with sys.tables and sys.columns since I noticed that INFORMATION_SCHEMA also contains views and sys.tables can be filtered for the table's type.
I added a "HAVING COUNT(*) >= 5" into the dynamic SQL so I don't save those informations in the first place rather than filtering them later.
Finally I went with "(NOLOCK)" because you only try to acces the tables for reading and that way you don't lock them for other users / interactions.
(The @i and @max are just for tracking the progress since I ran the query on ~10k columns and just wanted to see how far it is.)
Hopefully might be helpful aswell although you seem to have solved your problem.
DECLARE @columnName nvarchar(100),
@tableName nvarchar(100),
@sql nvarchar(MAX),
@i int = 0,
@max int = (SELECT COUNT(*)
FROM sys.tables T
INNER JOIN sys.columns C ON T.object_id = C.object_id
WHERE T.[type] = 'U')
DROP TABLE IF EXISTS #resultTable
CREATE TABLE #resultTable (ColumnName nvarchar(100), TableName nvarchar(100), ResultCount int)
DECLARE db_cursor CURSOR FOR
SELECT C.[name], T.[name]
FROM sys.tables T
INNER JOIN sys.columns C ON T.object_id = C.object_id
WHERE T.[type] = 'U'
OPEN db_cursor
FETCH NEXT FROM db_cursor INTO @columnName, @tableName
WHILE @@FETCH_STATUS = 0
BEGIN
SET @sql = CONCAT(' INSERT INTO #resultTable (ColumnName, TableName, ResultCount)
SELECT ''', @columnName, ''', ''', @tableName, ''', COUNT(*)
FROM (
SELECT DISTINCT [', @columnName, ']
FROM [', @tableName, '] (NOLOCK)
WHERE [', @columnName, '] IS NOT NULL
) t
HAVING COUNT(*) >= 5')
EXEC sp_executesql @sql
SET @i = @i 1
PRINT CONCAT(@i, ' / ', @max)
FETCH NEXT FROM db_cursor INTO @columnName, @tableName
END
CLOSE db_cursor
DEALLOCATE db_cursor
SELECT *
FROM #resultTable