Is there a way to find unique count of values and null percent for all columns in mysql?-CodePudding

I want to find the unique count and null percent for all columns in a mysql database for the purpose of data discovery(tables with approximation of null/distinct is also fine). Postgres has the pg_stats table with the null_frac and n_distinct fields, is there a table containing similar stats in mysql? Not looking to run this query through all the columns.

SELECT COUNT( DISTINCT col1) FROM table1;

Here is a sample data:

CREATE TABLE employees (
  employeeNumber INT NOT NULL,
  lastName VARCHAR(50) NOT NULL,
  firstName VARCHAR(50) NOT NULL,
  extension VARCHAR(10) NOT NULL,
  email VARCHAR(100) NOT NULL,
  officeCode VARCHAR(10) NOT NULL,
  reportsTo INT DEFAULT NULL,
  jobTitle VARCHAR(50) NOT NULL,
  PRIMARY KEY (employeeNumber)
) ;

INSERT  INTO employees(employeeNumber,lastName,firstName,extension,email,officeCode,reportsTo,jobTitle) VALUES
(1002,'Murphy','Diane','x5800','[email protected]','1',NULL,'President'),
(1056,'Patterson','Mary','x4611','[email protected]','1',1002,'VP Sales'),
(1076,'Firrelli','Jeff','x9273','[email protected]','1',1002,'VP Marketing'),
(1088,'Patterson','William','x4871','[email protected]','6',1056,'Sales Manager (APAC)'),
(1102,'Bondur','Gerard','x5408','[email protected]','4',1056,'Sale Manager (EMEA)'),
(1143,'Bow','Anthony','x5428','[email protected]','1',1056,'Sales Manager (NA)'),
(1165,'Jennings','Leslie','x3291','[email protected]','1',1143,'Sales Rep'),
(1166,'Thompson','Leslie','x4065','[email protected]','1',1143,'Sales Rep'),
(1188,'Firrelli','Julie','x2173','[email protected]','2',1143,'Sales Rep'),
(1216,'Patterson','Steve','x4334','[email protected]','2',1143,'Sales Rep'),
(1286,'Tseng','Foon Yue','x2248','[email protected]','3',1143,'Sales Rep'),
(1323,'Vanauf','George','x4102','[email protected]','3',1143,'Sales Rep'),
(1337,'Bondur','Loui','x6493','[email protected]','4',1102,'Sales Rep'),
(1370,'Hernandez','Gerard','x2028','[email protected]','4',1102,'Sales Rep'),
(1401,'Castillo','Pamela','x2759','[email protected]','4',1102,'Sales Rep'),
(1501,'Bott','Larry','x2311','[email protected]','7',1102,'Sales Rep'),
(1504,'Jones','Barry','x102','[email protected]','7',1102,'Sales Rep'),
(1611,'Fixter','Andy','x101','[email protected]','6',1088,'Sales Rep'),
(1612,'Marsh','Peter','x102','[email protected]','6',1088,'Sales Rep'),
(1619,'King','Tom','x103','[email protected]','6',1088,'Sales Rep'),
(1621,'Nishi','Mami','x101','[email protected]','5',1056,'Sales Rep'),
(1625,'Kato','Yoshimi','x102','[email protected]','5',1621,'Sales Rep'),
(1702,'Gerard','Martin','x2312','[email protected]','4',1102,'Sales Rep');

This the expected output able to get from postgres using the pg_stats table(specifically the num_null and num_of_distinct columns) Sample table structure and data in postgres: https://pastebin.com/fXg3RJdi

Thanks a lot in advance!

CodePudding user response：

This idea is using prepared statement.

Setting up base query:

SET @sql1 := (SELECT GROUP_CONCAT(CONCAT('SELECT "',column_name,'" AS column_name, 
COUNT(DISTINCT ',column_name,') AS num_of_distinct,
SUM(CASE WHEN ',column_name,' IS NULL THEN 1 ELSE 0 END) AS num_null,
COUNT(*) AS TotalRows
FROM employees ') SEPARATOR ' 
UNION ALL
') 
FROM information_schema.COLUMNS 
WHERE table_name='employees'
AND table_schema= 'public');

/*Check @sql1 variable*/
SELECT @sql1;

Setting up outer query with percentage calculation:

SET @sql2 := CONCAT('SELECT *, CAST(num_null/TotalRows AS DECIMAL(14,7)) perc_of_null, 
ROUND((num_of_distinct/TotalRows)*100,2) perc_of_distinct
FROM (',@sql1,') tv;');

/*Check @sql2 variable*/
SELECT @sql2;

Preparing statement from @sql2 variable, execute then deallocate it:

PREPARE stmt FROM @sql2;
EXECUTE stmt;
DEALLOCATE PREPARE stmt;

Result:

column_name	num_of_distinct	num_null	TotalRows	perc_of_null	perc_of_distinct
employeeNumber	23	0	23	0.0000000	100.00
lastName	19	0	23	0.0000000	82.61
firstName	21	0	23	0.0000000	91.30
extension	20	0	23	0.0000000	86.96
email	22	0	23	0.0000000	95.65
officeCode	7	0	23	0.0000000	30.43
reportsTo	6	1	23	0.0434783	26.09
jobTitle	7	0	23	0.0000000	30.43

Here's a demo fiddle.

For the full result that is according to your screenshot, you probably can modify the @sql2 variable as following:

SET @sql2 := CONCAT('SELECT CONCAT(tt.table_schema,".",tt.table_name),
       cc.COLUMN_NAME,
       cc.COLUMN_TYPE,
       cc.IS_NULLABLE,
       cc.COLUMN_KEY,
       tt.TABLE_COMMENT,
       cc.EXTRA,
       num_of_distinct,
       num_null,
       num_null/TotalRows perc_of_null, ROUND((num_of_distinct/TotalRows)*100,2) perc_of_distinct
FROM information_schema.COLUMNS cc 
  JOIN information_schema.TABLES tt ON cc.TABLE_SCHEMA=tt.TABLE_SCHEMA AND cc.TABLE_NAME=tt.TABLE_NAME 
JOIN (',@sql1,') tv ON cc.COLUMN_NAME=tv.column_name  
WHERE tt.table_schema="Public" AND tt.table_name="employees";');

However, I'm not entirely sure if the table attributes on MySQL matches the ones from Postgres.

CodePudding user response：

You can create your own procedure that will build such a report using prepared statements.

DELIMITER //

CREATE PROCEDURE sys_column_stats (IN tableName varchar(64))
BEGIN

  SET SESSION group_concat_max_len = 1000000;

  SELECT 
    GROUP_CONCAT(
      'SELECT \'', c.column_name,'\' AS column_name,',
      '(SELECT COUNT(DISTINCT `', c.column_name,'`) FROM `', table_name, '`) AS distinct_count,', 
      '(SELECT SUM(`', c.column_name,'` IS NULL) FROM `', table_name, '`) AS null_count'
      -- ... More stats you need
      SEPARATOR ' UNION ALL '
    ) AS sql_statement
  INTO @sql
  FROM information_schema.columns c
  WHERE `table_name` = tableName;
    
  PREPARE stmt FROM @sql;
  EXECUTE stmt; 
  DEALLOCATE PREPARE stmt;

END//

DELIMITER ;

Usage:

CALL sys_column_stats('employees');

The SQL statement in the procedure prepares the UNION ALL query as

SELECT
    'email' AS column_name,
    (SELECT COUNT(DISTINCT `email`) FROM `employees`) AS distinct_count,
    (SELECT SUM(`email` IS NULL) FROM `employees`) AS null_count
UNION ALL
SELECT 
    'employeeNumber' AS column_name,
    (SELECT COUNT(DISTINCT `employeeNumber`) FROM `employees`) AS distinct_count,
    (SELECT SUM(`employeeNumber` IS NULL)FROM `employees`) AS null_count
UNION ALL
[ ... ]

fiddle