Is there any (invertible) way (in c#) to convert a string into a smaller one, and when I say smaller-CodePudding

Let me explain: in my use case a system gives me many strings that can vary in size (number of characters; length), sometimes it can be really huge! The problem is that I have to save this string in a column of a table of a "SQL Server" database, the bad news is that I am not allowed to do any migration in this database, the good news is that the column already has type nvarchar(max).

I've done some research before and followed the following post to write a data compressor using "Gzip" and "Brotli".

https://khalidabuhakmeh.com/compress-strings-with-dotnet-and-csharp

var value = "hello world";
var level = CompressionLevel.SmallestSize;

var bytes = Encoding.Unicode.GetBytes(value);
await using var input = new MemoryStream(bytes);
await using var output = new MemoryStream();

// GZipStream with BrotliStream
await using var stream = new GZipStream(output, level);

await input.CopyToAsync(stream);

var result = output.ToArray();
var resultString = Convert.ToBase64String(result);

After implementing the conversion methods, I created tests that generate random strings of varying sizes (length) to validate the performance of the compressors, at this point I noticed the following. Both "Gzip" and "Brotli" first convert to byte[] (byte array), then apply compression, which gives a result vector (byte array) of reduced size as expected, but then converts the result (byte[]) to a base 64 string that, in 100% of the tests it has more characters (length) than the initial string.

My random string generator:

var rd_char = new Random();
var rd_length = new Random();
var wordLength = rd_length.Next(randomWordParameters.WordMinLength, randomWordParameters.WordMaxLength);
var sb = new StringBuilder();
int sourceNumber;
for (int i = 0; i < wordLength; i  )
{
    sourceNumber = rd_char.Next(randomWordParameters.CharLowerBound, randomWordParameters.CharUpperBound);
    sb.Append(Convert.ToChar(sourceNumber));
}
var word = sb.ToString();

My sample strings don't exactly contain perfect representations of the cases at hand, but I believe they are good enough. Here's the string generator method, in fact it generates completely random strings in a given size range, and I used in the tests characters provided from the 33 ~ 127 values passed to the Convert.ToChar() method. The strings provided by the system are in JSON format, in practice they are lists of URLs (with tens of thousands of urls), the urls usually have random sequences of characters, so I tried to generate strings as random as possible.

The fact is that, considering the case where I try to save in the database a string that originally (before compression) is larger than the maximum size (length) allowed in the column, when saving in the database, the "data" that goes to the column of the table in question is the result "base 64" string generated after compression and not the reduced size vector (byte array), and I believe the database will refuse the string (base 64) since its length (in the number of characters) is greater than the length of the original string.

So here's my question, is there any (invertible) way to convert a string into a smaller one, and when I say smaller I mean "with reduced length"? It seems that "Gzip" or "Brotli" doesn't solve the problem.

PS: I made sure to highlight the term "length" several times to make it clear that at that point in the text I'm talking about the number of characters and not the length in memory, because I noticed in several forums that I read before, that this confusion was making it difficult to reach conclusions.

CodePudding user response：

The max size for a column of type NVARCHAR(MAX) is 2 GByte of storage.

Since NVARCHAR uses 2 bytes per character, that's approx. 1 billion characters.

So I don't think you actually need to make a compression, if the problem is the performance when retrieving data, then you can use a server side caching system.

CodePudding user response：

The compression algorithms are exploiting repetitive patterns in the input stream. There is no much repetition in a typical URL, so compressing a single URL is unlikely to yield a representation that is much shorter than the original. In case the URL has no repetitive patterns at all (if it's close to a random string), the compression algorithm is going to yield a larger output than the input.

Below is a demonstration of this behavior, using the Encoding.UTF8 to convert the URLs to bytes, and the Encoding.Latin1 to convert the compressed bytes to a string:

static string Compress(string value)
{
    byte[] bytes = Encoding.UTF8.GetBytes(value);
    using var input = new MemoryStream(bytes);
    using var output = new MemoryStream();
    using (var gz = new GZipStream(output, CompressionLevel.SmallestSize))
        input.CopyTo(gz);
    byte[] result = output.ToArray();
    return Encoding.Latin1.GetString(result);
}

static string Decompress(string compressedValue)
{
    byte[] bytes = Encoding.Latin1.GetBytes(compressedValue);
    using var input = new MemoryStream(bytes);
    using var output = new MemoryStream();
    using (var gz = new GZipStream(input, CompressionMode.Decompress))
        gz.CopyTo(output);
    byte[] result = output.ToArray();
    return Encoding.UTF8.GetString(result);
}

I used three fairly long and non-repetitive URLs for the test:

string[] urls = new string[]
{
    "https://stackoverflow.com/questions/71884821/is-there-any-invertible-way-in-c-to-convert-a-string-into-a-smaller-one-an#comment127033258_71884821",
    "https://github.com/dotnet/runtime/blob/2d4f2d0c8f60d5f49e39f3ddbe1824648ee2b306/src/libraries/System.Private.CoreLib/src/System/Text/Encoding.cs#L77",
    "https://sharplab.io/#v2:CYLg1APgAgTAjAWAFBQMwAJabgdmQb2XWMwygBZ0BZAQwEsA7ACgEoiTCkTvsBOJgEQAJAKYAbMQHt0Ad0kAnMcAEsA3O2IBfZJqA===",
};
foreach (var original in urls)
{
    Console.WriteLine($"Original:     {original.Length} chars, {original.Substring(0, 50)}...");
    var compressed = Compress(original);
    double compression = (original.Length - compressed.Length) / (double)original.Length;
    Console.WriteLine($"Compressed:   {compressed.Length} chars, compression: {compression:0.00%}");
    var decompressed = Decompress(compressed);
    Console.WriteLine($"Decompressed: {decompressed.Length} chars");
    Console.WriteLine($"Successful:   {decompressed == original}");
    Console.WriteLine();
}

Output:

Original:     145 chars, https://stackoverflow.com/questions/71884821/is-th...
Compressed:   133 chars, compression: 8.28%
Decompressed: 145 chars
Successful:   True

Original:     148 chars, https://github.com/dotnet/runtime/blob/2d4f2d0c8f6...
Compressed:   143 chars, compression: 3.38%
Decompressed: 148 chars
Successful:   True

Original:     128 chars, https://sharplab.io/#v2:CYLg1APgAgTAjAWAFBQMwAJabg...
Compressed:   141 chars, compression: -10.16%
Decompressed: 128 chars
Successful:   True

Try it on Fiddle.

Two of the three URLs became slightly shorter after the compression, but the third URL was bloated instead.

You could store in the database either the compressed or the original value, depending on which is the shorter one. You could prefix the stored value with some marker, for example 'C' or 'U', so that you know if it's compressed or uncompressed.