Home > Software design >  How to validate that UTF-8 columns actually save space?
How to validate that UTF-8 columns actually save space?

Time:02-04

SQL Server 2019 introduces support for the widely used UTF-8 character encoding.

I have a large table that stores sent emails. So I'd like to give this feature a try.

ALTER TABLE dbo.EmailMessages 
ALTER COLUMN Body NVARCHAR(MAX) COLLATE Latin1_General_100_CI_AI_SC_UTF8;

ALTER TABLE dbo.EmailMessages  REBUILD;

My concern is that I don't know how to verify size gains. It seems that popular scripts for size estimation do not properly report size in this case.

CodePudding user response:

Basically, column type must be converted to VARCHAR(MAX) then data is stored in a more compact manner:

To limit the amount of changes required for the above scenarios, UTF-8 is enabled in existing the data types CHAR and VARCHAR. String data is automatically encoded to UTF-8 when creating or changing an object’s collation to a collation with the “_UTF8” suffix, for example from LATIN1_GENERAL_100_CI_AS_SC to LATIN1_GENERAL_100_CI_AS_SC_UTF8.

Size can be inspected using sp_spaceused:

sp_spaceused N'EmailMessages';

If unused space is high then you might need to reorganize:

ALTER INDEX ALL ON dbo.EmailMessages REORGANIZE WITH (LOB_COMPACTION = ON);

In my case size was reduced by a factor of ~2 (mostly English text).

CodePudding user response:

You can use a query like the following to compare string lengths. It assumes a table named @Data with a column called String.

SELECT *
FROM @Data
CROSS APPLY (
    SELECT
        CONVERT(VARCHAR(MAX), String COLLATE LATIN1_GENERAL_100_CI_AS_SC_UTF8) AS Utf8String
) U
CROSS APPLY (
    SELECT
        LEN(String) AS Length,
        --LEN(Utf8String) AS Utf8Length,
        LEN(CONVERT(VARBINARY(MAX), String)) AS NVarcharBytes,
        LEN(CONVERT(VARBINARY(MAX), Utf8String)) AS Utf8Bytes
) L
CROSS APPLY (
    SELECT
        CASE WHEN Utf8Bytes < NVarcharBytes THEN 'Yes' ELSE '' END AS IsShorter,
        CASE WHEN Utf8Bytes > NVarcharBytes THEN 'Yes' ELSE '' END AS IsLonger
) C
CROSS APPLY (
    SELECT
        CONVERT(VARCHAR(MAX), CONVERT(VARBINARY(MAX), String), 1) AS NVarcharHex,
        CONVERT(VARCHAR(MAX), CONVERT(VARBINARY(MAX), Utf8String), 1) AS Utf8Hex
) H

You can replace FROM @Data with something like FROM (SELECT Email FROM YourTable) D to query your specific data. Replace SELECT * with SELECT SUM(NVarcharBytes) AS NVarcharBytes, SUM(Utf8Bytes) AS Utf8Bytes to get totals.

See this db<>fiddle.

  • Related