Retrieve tables based on its contents in SQL server-CodePudding

I'd like to retrieve all tables and the associated column values where two of their specific columns (the column names will be passed into) that don't have the exact same content in them.

Here's a more definite break-down of the problem. Suppose, the columns that I need to look into is 'Column_1' and 'Column_2'

First identify from in INFORMATION_SCHEMA which of the tables have both of these columns present in them(possible one sub-query),
And then identify which of these tables don't have exact same content on these 2 columns meaning Column_1 != Column_2.

The following section would retrieve all the tables that has both 'Column_1' and 'Column_2'.

SELECT
    TABLE_NAME
FROM
    INFORMATION_SCHEMA.TABLES T
WHERE
    T.TABLE_CATALOG = 'myDB' AND
    T.TABLE_TYPE = 'BASE TABLE'
    AND EXISTS (
        SELECT T.TABLE_NAME
        FROM INFORMATION_SCHEMA.COLUMNS C
        WHERE
            C.TABLE_CATALOG = T.TABLE_CATALOG AND
            C.TABLE_SCHEMA = T.TABLE_SCHEMA AND
            C.TABLE_NAME = T.TABLE_NAME AND
            C.COLUMN_NAME = 'Column_1')
    AND EXISTS
    (
    SELECT T.TABLE_NAME
        FROM INFORMATION_SCHEMA.COLUMNS C
        WHERE
            C.TABLE_CATALOG = T.TABLE_CATALOG AND
            C.TABLE_SCHEMA = T.TABLE_SCHEMA AND
            C.TABLE_NAME = T.TABLE_NAME AND
            C.COLUMN_NAME = 'Column_2')

As the next step, I tried to use this as a sub-query and have the following at the end but that doesn't work and sql-server returns 'Cannot call methods on sysname'. What would the next step on this? This problem assumes all columns has the exact same Data-type.

WHERE SUBQUERY.TABLE_NAME.Column_1 != SUBQUERY.TABLE_NAME.Column_2

This is what's expected :

Table_Name	Column_Name1	Column_Value_1	Column_Name2	Column_Value_2
Table_A	Column_1	abcd	Column_2	abcde
Table_A	Column_1	qwerty	Column_2	qwert
Table_A	Column_1	abcde	Column_2	eabcde
Table_B	Column_1	zxcv	Column_2	zxcde
Table_C	Column_1	asdfgh	Column_2	asdfghy
Table_C	Column_1	aaaa	Column_2	bbbb

CodePudding user response：

I believe you need to compare the CHARACTER_MAXIMUM_LENGTH or CHARACTER_OCTET_LENGTH metadata values in the INFORMATION_SCHEMA.COLUMNS table instead of using LEN(). This can be done using something like:

SELECT T.TABLE_NAME
    , C1.COLUMN_NAME, C1.DATA_TYPE, C1.CHARACTER_MAXIMUM_LENGTH
    , C2.COLUMN_NAME, C2.DATA_TYPE, C2.CHARACTER_MAXIMUM_LENGTH
FROM INFORMATION_SCHEMA.TABLES T
JOIN INFORMATION_SCHEMA.COLUMNS C1
    ON C1.TABLE_CATALOG = T.TABLE_CATALOG
    AND C1.TABLE_SCHEMA = T.TABLE_SCHEMA
    AND C1.TABLE_NAME = T.TABLE_NAME
    AND C1.COLUMN_NAME = 'Column_1'
JOIN INFORMATION_SCHEMA.COLUMNS C2
    ON C2.TABLE_CATALOG = T.TABLE_CATALOG
    AND C2.TABLE_SCHEMA = T.TABLE_SCHEMA
    AND C2.TABLE_NAME = T.TABLE_NAME
    AND C2.COLUMN_NAME = 'Column_2'
WHERE T.TABLE_CATALOG = 'myDB'
AND T.TABLE_TYPE = 'BASE TABLE'
AND C1.CHARACTER_MAXIMUM_LENGTH <> C2.CHARACTER_MAXIMUM_LENGTH

The inner joins both limit results to tables having both columns and retrieve the column metadata. The length compare at the end checks for a mismatch.

This assumes character types. You might also want to check DATA_TYPE consistency ("char" vs "varchar" vs "nvarchar") or some of the other precision and scale values for other non-character data types.

CodePudding user response：

To query the data within the columns you need dynamic SQL. I would advise you not to use INFORMATION_SCHEMA (which is for compatibility only) and instead use sys.tables etc. You don't need to check sys.columns twice, you can use aggregation in the EXISTS subquery to check for multiple columns.

To compare the columns, you can do Column_1 <> Column_2, but that will not deal with nulls correctly. If the columns can be nullable then you should instead use the syntax shown in the code below: NOT EXISTS (SELECT Column_1 INTERSECT SELECT Column_2)

DECLARE @sql nvarchar(max);

SELECT
  STRING_AGG(CAST('
SELECT 
  Table_Name = '   QUOTENAME(t.name, '''')   ',
  Column_1,
  Column_2
FROM '   QUOTENAME(s.name)   '.'   QUOTENAME(t.name)   '
WHERE NOT EXISTS (SELECT Column_1 INTERSECT SELECT Column_2)
'  AS nvarchar(max)), '
UNION ALL
' )
FROM sys.tables t
JOIN sys.schemas s ON s.schema_id = t.schema_id
  AND s.name = 'myDB'
WHERE EXISTS (SELECT 1
    FROM sys.columns c
    WHERE c.object_id = t.object_id
      AND c.name IN ('Column_1', 'Column_2')
    HAVING COUNT(*) = 2
       AND COUNT(DISTINCT c.system_type_id) = 1  -- all same type
);

PRINT @sql;     -- your friend

EXEC sp_executesql @sql;

CodePudding user response：

If in fact you want to actually compare values (not length) between two columns in tables that contain those two columns, you will need to generate dynamic SQL and then execute it. This could be done semi-automatically with the following:

DECLARE @SqlTemplate VARCHAR(MAX) =
    'UNION ALL'
      ' SELECT Table_Name = <TNAME>'
      ', Column_Name1 = <C1NAME>, Column_Value_1 = <C1>'
      ', Column_Name2 = <C2NAME>, Column_Value_2 = <C2>'
      ' FROM <T>'
      ' WHERE ISNULL(<C1>, '(null)') <> ISNULL(<C2>, '(null)')'

SELECT T.TABLE_NAME
    , REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(
        @SqlTemplate
        , '<TNAME>', QUOTENAME(T.TABLE_SCHEMA   '.'   T.TABLE_NAME, ''''))
        , '<C1NAME>', QUOTENAME(C1.COLUMN_NAME, ''''))
        , '<C2NAME>', QUOTENAME(C2.COLUMN_NAME, ''''))
        , '<T>', QUOTENAME(T.TABLE_SCHEMA)   '.'   QUOTENAME(T.TABLE_NAME))
        , '<C1>', QUOTENAME(C1.COLUMN_NAME))
        , '<C2>', QUOTENAME(C2.COLUMN_NAME))
FROM INFORMATION_SCHEMA.TABLES T
JOIN INFORMATION_SCHEMA.COLUMNS C1
    ON C1.TABLE_CATALOG = T.TABLE_CATALOG
    AND C1.TABLE_SCHEMA = T.TABLE_SCHEMA
    AND C1.TABLE_NAME = T.TABLE_NAME
    AND C1.COLUMN_NAME = 'Column_1'
JOIN INFORMATION_SCHEMA.COLUMNS C2
    ON C2.TABLE_CATALOG = T.TABLE_CATALOG
    AND C2.TABLE_SCHEMA = T.TABLE_SCHEMA
    AND C2.TABLE_NAME = T.TABLE_NAME
    AND C1.COLUMN_NAME = 'Column_2'
WHERE T.TABLE_CATALOG = 'myDB'
AND T.TABLE_TYPE = 'BASE TABLE'

This would generate sql for each qualifying table of the form:

UNION ALL SELECT Table_Name = 'dbo.Z', Column_Name1 = 'X', Column_Value_1 = [X], Column_Name2 = 'Y', Column_Value_2 = [Y] FROM [dbo].[Z] WHERE ISNULL([X], '(null)') <> ISNULL([Y], '(null)')

After running the above, you would then cut & paste the generated SQL into another query window, remove the initial 'UNION ALL', and then execute the remaining SQL to get the final results.

There are ways of combining all the SQL into a single string and executing it automatically, but your problem sounds like a one-off process that doesn't warrant the extra complexity.