mongodb snappy compression Data Size vs Storage Size-CodePudding

I am trying to compare mongodb ( latest from git repo) compression rates of snappy, zstd, etc. Here is relevant snip from my my /etc/mongod.conf

storage:
  dbPath: /var/lib/mongodb
  journal:
    enabled: true
  wiredTiger:
    engineConfig:
      journalCompressor: snappy
    collectionConfig:
      blockCompressor: snappy
    indexConfig:
      prefixCompression: true

My test case inserts entries into a collection. Each db entry has an _id and 1MB of binary. The binary is randomly generated using faker. I input 5GB/7GB of data but the storage size does not appear to be compressed. The AWS instance hosting the monodb has 15GB of mem and 100GB disk space. Here is what I see for example data collected from dbstat:

5GB data:

{'Data Size': 5243170000.0,
 'Index Size': 495616.0,
 'Storage size': 5265686528.0,
 'Total Size': 5266182144.0}

7GB data:

{'Data Size': 7340438000.0,
 'Index Size': 692224.0,
 'Storage size': 7294259200.0,
 'Total Size': 7294951424.0}

Is there something wrong with my config? or does compression not kick in until the data size is substantially larger than memory size? Or available storage size? What am I missing here?

Thanks a ton for you help.

CodePudding user response：

Compression algorithms work by identifying repeating patterns in the data, and replacing those patterns with identifiers that are significantly smaller.

Unless the random number generator is a very bad one, random data doesn't have any patterns and so doesn't compress well.

A quick demonstration:

~% dd if=/dev/urandom bs=1024 count=1024 of=rand.dat
1024 0 records in
1024 0 records out
1048576 bytes transferred in 0.011312 secs (92695833 bytes/sec)

~% ls -l rand.dat
-rw-r--r--  1 user  group  1048576 Oct 22 18:22 rand.dat

~% gzip -9 rand.dat

~% ls -l rand.dat.gz
-rw-r--r--  1 user  group  1048923 Oct 22 18:22 rand.dat.gz

This shows that even gzip with it's best/slowest compression setting generates a "compressed" file that is actually bigger than the original.

You might try using a random object generator to create the documents, like this one: https://stackoverflow.com/a/2443944/2282634

CodePudding user response：

As already explained and shown by Joe you cannot compress random data, it's a math law.

If you like to get real data then visit one of the "open data" projects, for example https://ourworldindata.org/

Often these data set are provided also as JSON or CSV format, so you can easily import them.