Home > Net >  Generating short strings (letters and numbers) with lowest length possible (for url shortner): Try r
Generating short strings (letters and numbers) with lowest length possible (for url shortner): Try r

Time:09-02

I am developing a url shortner and I would like to know your opinion on this matter.

I want to generate shortlinks, for the shortlinks I need shortstrings. For the shortstrings I want only letters and numbers, and starting with the length of 3 I want the users to get the smallest possible string as long as it is available.

If I allow only letters and numbers, for strings with the length of 3, we can have 46656 possibilities.

I've though about 2 paths:

  1. Generating random string --> Checking for availability (does not exist in shortlinks table - meaning not yet used) --> If not yet in use we use it --> If already in use, the system needs to generate another random string and check again for availability, and repeat until we find one that is available. Considering that we have 46656 possibilities for the length of 3, eventually, as the shortstrings/links are taken, the system would have to retry the random and recheck for availability more times ( meaning more queries, meaning waiting depending on the amount of conflicts ), also I would need to know when to start generating strings with the length of 4 ( when all the 46656 possibilities are taken ). Considering my table can have an 'unique' flag for the shortstrings/links, I would only have to check if totals are >= 46656, to start generating strings with the length of 4, and then after that >= (46656 the total possibilities with the length of 4), to start generating strings with the length of 5.. and so on.. ( some math )..

  2. Seeding the database on project install / post project install as needed: I can have a 'shortstrings' table, with 'shortstring' and 'is_available' for example, and have a seeder seed this data (aaa, aab, aac...) with the is_available=1, I can initially decide to seed all the possibilities with the length of 3, or also 4, etc. (So either having alot of available links right from the start, or as needed: seed more). This way when generating a shortlink, I would just have to find() the next is_available=1 shortstring, and when used set it's is_available to 0 ( no retries, not generating randoms and possibly having conflicts ). If I don't want to manually monitor / seed the available links, I can have the system check for how many available links are left and if the number is low tell it to seed for the next length ( and this would make alot more links available , since all possibilities with the length of 4 are way more than with the lenght of 3)

I believe the second option is more performant, and it is what I am going for ( I have already set it up ) but I would like to know more opinions on this matter.

Just in case you want to know, I am using php/laravel/mariadb/mysql for this case scenario.

Thanks in advance.

CodePudding user response:

In general terms, you have two workflows. The one in which you receive an url, generate a short identifier, store it and present it to the user. The other one, when you translate your identifers back to their original URL. There's not much else to a generic link shortener, so your problem boils down to figuring out the alphanumeric identifier keeping it short, and ensuring its uniqueness. It's all a tradeoff between efficiency and shortness, the latter being the most valuable, in your eyes.

I'd say the bottleneck in this case is the roundtrip to the database, so the less queries you perform, the better.


Your first approach is to retrieve every used identifier, then generate random strings until you come up with one that's available, at which point you perform the insertion. As you approach depletion and you find yourself performing tens of thousands of iterations (incurring in a potentially serious race condition) you plan to bump the identifier length to 4 and so on, This means you need to perform 3 queries for each incoming link:

  • retrieving the used identifers
  • retrieving the current length
  • performing the actual insertion

Perhaps you wouldn't really store the current length in the DDBB, you could use an environment variable and read it statically from the config, but this still needs for human supervision to decide the right time to bump the length.


Your second approach would be to generate sequential identifers. This sounds more efficient, indeed, but leaves you open for enumeration from malicious users. This might seem irrelevant for your use case, but you shouldn't make assumptions in that regard from your end user's point of view. If you go for that approach, you still need to perform two queries

  • retrieving the last used identifier
  • storing the current link

I would humbly encourage you to take a look at at Hashids. It's a deterministic, reversible algorithm to encode numbers to alphanumeric strings. There are implementations for PHP, Ruby, Java, Python, Javascript and most well known languages (including PostgreSQL and PLpgSQL).

Starting from a salt that you hard code in your environment (for example) the hashes are unique to your app, meaning a third party wouldn't be able to compute the original ID without knowing it. The identifers are url friendly and curse words are avoided by design.

However, the shortest identifier would be of length 8. If you're willing to compromise on the length, rest assured that 8 characters are enough to encode roughly 300 billions of links. This approach needs for just one query: the actual insertion, after which you grab the ID and encode it to the final hash.

There are other approaches, of course, Encoding the PK to base64 is much cheaper, but the outcome's length is unactractive (you need 16 characters to encode 300 billions of links), plus it's still open to enumeration. UUIDs are extremely secure and unique, but their length is even more unnactractive.

CodePudding user response:

  1. create a table with id --> URL; the id will be AUTO_INCREMENT, so it will start with 1 and continue 'forever'. (See below)
  2. When inserting a new URL, get the id
  3. Convert the id to hex (4 bits per byte) or base64 (6 bits per byte). The first 4095 URLs could be encoded in 3 hex characters or 2 base64 characters. The first 16 million: 6 hex or 4 base64.

Note: Hackers will quickly discover the pattern and can pull up all your URLs. If you need the 'short' URLs to be more cryptic, you will have to sacrifice "short" some.

CREATE TABLE Urls (
    id INT UNSIGNED NOT NULL AUTO_INCREMENT,
    url VARCHAR(2000) NOT NULL,
    PRIMARY KEY(id),
    UNIQUE(url)
) ENGINE=InnoDB;

INSERT INTO Urls (url)
    VALUES (?)
    ON DUPLICATE KEY UPDATE
        id = LAST_INSERT_ID(id);
SELECT LAST_INSERT_ID();

(Caution: I may not have the syntax exactly correct.)

The IODKU (upsert) will either add a new row or discover the existing id if the URL already exists. The LAST_INSERT_ID is a kludge buried deep in the ref manual.

  • Related