Home > Net >  How to store number in shortest possible size?
How to store number in shortest possible size?

Time:12-19

I will be adding comments into my website and I want to have a tree structure so each comment can have a parent. This creates a problem when retrieving comments because each comment would have to be traversed for children and this is unacceptable when it comes to database storage.

There are various solutions to this well known performance problem and I want to use simple, and fast, path prefix scan(SELECT * FROM comments WHERE path LIKE 'commentID/%') where I will have index with path to each comment via all its parents and the comment's id as value.

In practice it would look like:

[comment id]/[comment id]/[comment id]/.. = [comment id]

This way I can find all comments for a certain path at once and also delete children at once when needed.

The comment id will be unsigned integer, asigned by the databse as auto-incrementing key.

The problem I am looking at is how to have the shortest possible representation of the comment id so the path index will not be too long.

I expect to use unsigned 64bit long integer which takes up 8 bytes in binary format. In case of three predecessors, the path value wil be 24 bytes 2 bytes for delimiter. If I would use a string representation, then it depends on the size of the numbers. In case each parent's id is 5 digits long(12345), the path value would be 17 bytes long and it would be more efficient than binary representation of the numbers by 9 bytes. But once the ids go above 8 digits, the string variant becomes less efficient, it is longer, than the binary format.

So my question is if there is another way to have short representation of an integer? I was thinking in line of bijective function.

CodePudding user response:

One obvious answer is to use a 32-bit integer instead of a 64-bit integer. Do you really think there will be more the 232 comments on your website? No offense intended, but I assume you are not implementing Facebook. You're not even implementing Stack Overflow, which only has 83 million comments so far.

If you really did need to support 264 distinct values, I don't see how a bijective function would help. The values in both sets related in the bijection would need a minimum of 64 bits each, and you wouldn't save anything.

Another answer is to change the way you store the hierarchical data, so you don't have to worry about the length of the comment path. You're using one way of managing tree-like data, which is sometimes called Materialized Path, and this naturally imposes a limit on the depth of the path, which is the point of your question.

A way to do this without Materialized Path is to just use the normal way of storing hierarchy: each comment has a reference to its own parent. Then you need to learn to use recursive SQL queries, which are supported in MySQL 8.0.

Or you could use other alternatives for storing the hierarchy. I like to use a Closure Table, see my answer to What is the most efficient/elegant way to parse a flat table into a tree?

Which design is best for your app depends on the kinds of queries you will need to make against these data. You'll have to experiment with each solution and see how well it works for the uses you will need to do, such as adding a comment, searching for a thread of comments, deleting a comment or a thread, etc.

CodePudding user response:

I would expect that 98% "paths" would be no more than 3 deep: "123/456/789". So, by using VARCHAR, that would take 2 11 bytes (2 for length; 11 for the string). In other words, I don't see space as a big issue.

I looked at a forum with 71K Questions. There were 185K Responses. Ignoring the "threading" (responses to responses), that says that 71K/185K (38%) were a single number (a la "123"). I do no have statistics on threading, but my 98% claim above, may be reasonably close to correct.

Bottom Line: I think that shrinking the string representing a "thread" is a minor issue.

(I have a Rule of Thumb: If a proposed change does not seem to save 10%, move on to other issues.) The thread string is probably much less than 10% of the disk space taken by all the other stuff -- especially the text. In my investigation, the "meta info" was 11MB; the text was 260MB. Your thread string would be a small part of the 11MB.

For saving space, I would compress() the text (the 260MB above) and store into a MEDIUMBLOB. Prose (or code or XML or json) typically shrinks by 3:1 (making that only 87MB). Do the compress/uncompress in the client, thereby offloading the server and the network.

On the other hand, you probably want a FULLTEXT index on the comments. So it can't be compressed.

  • Related