I am trying to compare mongodb ( latest from git repo) compression rates of snappy, zstd, etc. Here is relevant snip from my my /etc/mongod.conf
storage:
dbPath: /var/lib/mongodb
journal:
enabled: true
wiredTiger:
engineConfig:
journalCompressor: snappy
collectionConfig:
blockCompressor: snappy
indexConfig:
prefixCompression: true
My test case inserts entries into a collection. Each db entry has an _id and 1MB of binary. The binary is randomly generated using faker. I input 5GB/7GB of data but the storage size does not appear to be compressed. The AWS instance hosting the monodb has 15GB of mem and 100GB disk space. Here is what I see for example data collected from dbstat:
5GB data:
{'Data Size': 5243170000.0,
'Index Size': 495616.0,
'Storage size': 5265686528.0,
'Total Size': 5266182144.0}
7GB data:
{'Data Size': 7340438000.0,
'Index Size': 692224.0,
'Storage size': 7294259200.0,
'Total Size': 7294951424.0}
Is there something wrong with my config? or does compression not kick in until the data size is substantially larger than memory size? Or available storage size? What am I missing here?
Thanks a ton for you help.
CodePudding user response:
Compression algorithms work by identifying repeating patterns in the data, and replacing those patterns with identifiers that are significantly smaller.
Unless the random number generator is a very bad one, random data doesn't have any patterns and so doesn't compress well.
A quick demonstration:
~% dd if=/dev/urandom bs=1024 count=1024 of=rand.dat
1024 0 records in
1024 0 records out
1048576 bytes transferred in 0.011312 secs (92695833 bytes/sec)
~% ls -l rand.dat
-rw-r--r-- 1 user group 1048576 Oct 22 18:22 rand.dat
~% gzip -9 rand.dat
~% ls -l rand.dat.gz
-rw-r--r-- 1 user group 1048923 Oct 22 18:22 rand.dat.gz
This shows that even gzip with it's best/slowest compression setting generates a "compressed" file that is actually bigger than the original.
You might try using a random object generator to create the documents, like this one: https://stackoverflow.com/a/2443944/2282634
CodePudding user response:
As already explained and shown by Joe you cannot compress random data, it's a math law.
If you like to get real data then visit one of the "open data" projects, for example https://ourworldindata.org/
Often these data set are provided also as JSON or CSV format, so you can easily import them.