Home > OS >  mongodb snappy compression Data Size vs Storage Size
mongodb snappy compression Data Size vs Storage Size

Time:10-23

I am trying to compare mongodb ( latest from git repo) compression rates of snappy, zstd, etc. Here is relevant snip from my my /etc/mongod.conf

storage:
  dbPath: /var/lib/mongodb
  journal:
    enabled: true
  wiredTiger:
    engineConfig:
      journalCompressor: snappy
    collectionConfig:
      blockCompressor: snappy
    indexConfig:
      prefixCompression: true

My test case inserts entries into a collection. Each db entry has an _id and 1MB of binary. The binary is randomly generated using faker. I input 5GB/7GB of data but the storage size does not appear to be compressed. The AWS instance hosting the monodb has 15GB of mem and 100GB disk space. Here is what I see for example data collected from dbstat:

5GB data:

{'Data Size': 5243170000.0,
 'Index Size': 495616.0,
 'Storage size': 5265686528.0,
 'Total Size': 5266182144.0}

7GB data:

{'Data Size': 7340438000.0,
 'Index Size': 692224.0,
 'Storage size': 7294259200.0,
 'Total Size': 7294951424.0}

Is there something wrong with my config? or does compression not kick in until the data size is substantially larger than memory size? Or available storage size? What am I missing here?

Thanks a ton for you help.

CodePudding user response:

Compression algorithms work by identifying repeating patterns in the data, and replacing those patterns with identifiers that are significantly smaller.

Unless the random number generator is a very bad one, random data doesn't have any patterns and so doesn't compress well.

A quick demonstration:

~% dd if=/dev/urandom bs=1024 count=1024 of=rand.dat
1024 0 records in
1024 0 records out
1048576 bytes transferred in 0.011312 secs (92695833 bytes/sec)

~% ls -l rand.dat
-rw-r--r--  1 user  group  1048576 Oct 22 18:22 rand.dat

~% gzip -9 rand.dat

~% ls -l rand.dat.gz
-rw-r--r--  1 user  group  1048923 Oct 22 18:22 rand.dat.gz

This shows that even gzip with it's best/slowest compression setting generates a "compressed" file that is actually bigger than the original.

You might try using a random object generator to create the documents, like this one: https://stackoverflow.com/a/2443944/2282634

CodePudding user response:

As already explained and shown by Joe you cannot compress random data, it's a math law.

If you like to get real data then visit one of the "open data" projects, for example https://ourworldindata.org/

Often these data set are provided also as JSON or CSV format, so you can easily import them.

  • Related