I am trying to summarize some massive tables in a way that can help me investigate further for some data issues. There are hundreds of 1000s of rows, and roughly 80 columns of data in most of these tables.
From each table, I have already performed queries to throw out any columns that have all nulls or 1 value only. I've done this for 2 reasons.... for my purposes, a single value or nulls in the columns are not interesting to me and provide little information about the data; additionally, the next step is that I want to query each of the remaining columns and return up to 30 distinct values in each column (if the column has more than 30 distinct values, we show the 1st 30 distinct values).
Here is the general format of output I wish to create:
Column_name(total_num_distinct): distinct_val1(val1_count), distinct_val2(val2_couunt), ... distinct_val30(val30_count)
Assuming my data fields are integers, floats, and varchar2 data types, this is the SQL I was trying to use to generate that output:
declare
begin
for rw in (
select column_name colnm, num_distinct numd
from all_tab_columns
where
owner = 'scz'
and table_name like 'tbl1'
and num_distinct > 1
order by num_distinct desc
) loop
dbms_output.put(rw.colnm || '(' || numd || '): ');
for rw2 in (
select dis_val, cnt from (
select rw.colnm dis_val, count(*) cnt
from tbl1
group by rw.colnm
order by 2 desc
) where rownum <= 30
) loop
dbms_output.put(rw2.dis_val || '(' || rw2.cnt || '), ');
end loop;
dbms_output.put_line(' ');
end loop;
end;
I get the output I expect from the 1st loop, but the 2nd loop that is supposed to output examples of the unique values in each column, coupled with the frequency of their occurrence for the 30 values with the highest frequency of occurrence, appears to not be working as I intended. Instead of seeing unique values along with the number of times that value occurs in the field, I get the column names and count of total records in that table.
If the 1st loop suggests the first 4 columns in 'tbl1' with more than 1 distinct value are the following:
| colnm | numd |
|----------------|
| Col1 | 2 |
| Col3 | 4 |
| Col7 | 17 |
| Col12 | 30 |
... then the full output of 1st and 2nd loop together looks something like the following from my SQL:
Col1(2): Col1(tbl1_tot_rec_count), Col3(tbl1_tot_rec_count)
Col3(4): Col1(tbl1_tot_rec_count), Col3(tbl1_tot_rec_count), Col7(tbl1_tot_rec_count), Col12(tbl1_tot_rec_count)
Col7(17): Col1(tbl1_tot_rec_count), Col3(tbl1_tot_rec_count), Col7(tbl1_tot_rec_count), Col12(tbl1_tot_rec_count), .... , ColX(tbl1_tot_rec_count)
Col12(30): Col1(tbl1_tot_rec_count), Col3(tbl1_tot_rec_count), Col7(tbl1_tot_rec_count), Col12(tbl1_tot_rec_count), .... , ColX(tbl1_tot_rec_count)
The output looks cleaner when real data is output, each table outputting somewhere between 20-50 lines of output (i.e. columns with more than 2 values), and listing 30 unique values for each field (with their counts) only requires a little bit of scrolling, but isn't impractical. Just to give you an idea with fake values, the output would look more like this with real data if it was working correctly (but fake in my example):
Col1(2): DisVal1(874,283), DisVal2(34,578),
Col3(4): DisVal1(534,223), DisVal2(74,283), DisVal3(13,923), null(2348)
Col7(17): DisVal1(54,223), DisVal2(14,633), DisVal3(13,083), DisVal4(12,534), DisVal5(9,876), DisVal6(8,765), DisVal7(7654), DisVal8(6543), DisVal9(5432), ...., ...., ...., ...., ...., ...., ...., DisVal17(431)
I am not an Oracle or SQL guru, so I might not be approaching this problem in the easiest, most efficient way. While I do appreciate any better ways to approach this problem, I also want to learn why the SQL code above is not giving me the output I expect. My goal is trying to quickly run a single query that can tell me which columns have interesting data I might what to examine further in that table. I have probably 20 tables I need to examine that are all of similar dimensions and so very difficult to examine comprehensively. Being able to reduce these tables in this way to know what possible combinations of values may exist across the various fields in each of these tables would be very helpful in further queries to deep dive into the intricacies of the data.
CodePudding user response:
It's because the select rw.colnm dis_val, count(*) cnt from tbl1 group by rw.colnm order by 2 desc
is not doing at all what you think, and what you think to be done can't be done without dynamic SQL. What it does is in fact select 'a_column_of_tabl1' dis_val, count(*) cnt from tbl1 group by 'a_column_of_tabl1' order by 2 desc
and what you need to do is execute dynamically the SQL: 'select ' || rw.colnm || ' dis_val, count(*) cnt from tbl1 group by ' || rw.colnm || ' order by 2 desc'
.
CodePudding user response:
Here is the (beta) query you can use to get the SQL to execute: (I tested with user_* VIEWs instead of ALL_* to avoid getting to many results here...)
select utc.table_name, utc.column_name,
REPLACE( REPLACE(q'~select col_name, col_value, col_counter
from (
SELECT col_name, col_value, col_counter,
ROW_NUMBER() OVER(ORDER BY col_counter DESC) AS rn
FROM (
select '{col_name}' AS col_name,
{col_name} AS col_value, COUNT(*) as col_counter
from {table_name}
group by {col_name}
)
)
WHERE rn <= 30~', '{col_name}', utc.column_name), '{table_name}', utc.table_name) AS sql
from user_tab_columns utc
where not exists(
select 1
from user_indexes ui
join user_ind_columns uic on uic.table_name = ui.table_name
and uic.index_name = ui.index_name
where
ui.table_name = utc.table_name
and exists (
select 1 from user_ind_columns t where t.table_name = ui.table_name
and t.index_name = uic.index_name and t.column_name = utc.column_name
)
group by ui.table_name, ui.index_name having( count(*) = 1 )
)
and not exists(
SELECT 1
FROM user_constraints uc
JOIN user_cons_columns ucc ON ucc.constraint_name = uc.constraint_name
WHERE constraint_type IN (
'P', -- primary key
'U' -- unique
)
AND uc.table_name = utc.table_name AND ucc.column_name = utc.column_name
)
;