SQL Server 2019 introduces support for the widely used UTF-8 character encoding.
I have a large table that stores sent emails. So I'd like to give this feature a try.
ALTER TABLE dbo.EmailMessages
ALTER COLUMN Body NVARCHAR(MAX) COLLATE Latin1_General_100_CI_AI_SC_UTF8;
ALTER TABLE dbo.EmailMessages REBUILD;
My concern is that I don't know how to verify size gains. It seems that popular scripts for size estimation do not properly report size in this case.
CodePudding user response:
Basically, column type must be converted to VARCHAR(MAX)
then data is stored in a more compact manner:
To limit the amount of changes required for the above scenarios, UTF-8 is enabled in existing the data types CHAR and VARCHAR. String data is automatically encoded to UTF-8 when creating or changing an object’s collation to a collation with the “_UTF8” suffix, for example from LATIN1_GENERAL_100_CI_AS_SC to LATIN1_GENERAL_100_CI_AS_SC_UTF8.
Size can be inspected using sp_spaceused
:
sp_spaceused N'EmailMessages';
If unused space is high then you might need to reorganize
:
ALTER INDEX ALL ON dbo.EmailMessages REORGANIZE WITH (LOB_COMPACTION = ON);
In my case size was reduced by a factor of ~2 (mostly English text).
CodePudding user response:
You can use a query like the following to compare string lengths. It assumes a table named @Data
with a column called String
.
SELECT *
FROM @Data
CROSS APPLY (
SELECT
CONVERT(VARCHAR(MAX), String COLLATE LATIN1_GENERAL_100_CI_AS_SC_UTF8) AS Utf8String
) U
CROSS APPLY (
SELECT
LEN(String) AS Length,
--LEN(Utf8String) AS Utf8Length,
LEN(CONVERT(VARBINARY(MAX), String)) AS NVarcharBytes,
LEN(CONVERT(VARBINARY(MAX), Utf8String)) AS Utf8Bytes
) L
CROSS APPLY (
SELECT
CASE WHEN Utf8Bytes < NVarcharBytes THEN 'Yes' ELSE '' END AS IsShorter,
CASE WHEN Utf8Bytes > NVarcharBytes THEN 'Yes' ELSE '' END AS IsLonger
) C
CROSS APPLY (
SELECT
CONVERT(VARCHAR(MAX), CONVERT(VARBINARY(MAX), String), 1) AS NVarcharHex,
CONVERT(VARCHAR(MAX), CONVERT(VARBINARY(MAX), Utf8String), 1) AS Utf8Hex
) H
You can replace FROM @Data
with something like FROM (SELECT Email FROM YourTable) D
to query your specific data. Replace SELECT *
with SELECT SUM(NVarcharBytes) AS NVarcharBytes, SUM(Utf8Bytes) AS Utf8Bytes
to get totals.
See this db<>fiddle.