Home > Enterprise >  Is it better to generate an identifier in the code or in the database?
Is it better to generate an identifier in the code or in the database?

Time:10-15

I have a relatively simple question; however I am very curious what the convention is and what the reason is for such a convention. The database is PostgreSQL and programming language that I am using is Python, but this does not lie at the core of my question.

Suppose we have the following JSON datastructure, which we still need to parse.

{
  "harry": {
    "transactions": {
      "desc": ["fish", "drinks", "potatoes"],
      "amount": [32, 12, 35]
    },
    "country": "UK"
  },
  "james": {
    "transactions": {
      "desc": ["computer", "water", "table", "phone"],
      "amount": [100, 32, 59, 99]
    },
    "country": "China"
  }
}

and we would like to put this in a PostgreSQL database. I am inclined to create UUIDs for the persons "harry" and "james" and also some UUIDs for their transactions and then insert them into the database. Resulting in three tables: personal_info, trans and pi_trans (which links the two tables).

However, one could also argue that I let the PostgreSQL database generate an identifier (that increases by 1 after every insert) and then I populate the pi_trans table with identifiers I retrieve from PostgreSQL.

I think perhaps the latter approach is way too slow, but other than that, I do not see any other clear reason why we should not do the second appraoch.

In addition, the second approach does allow for unique identifiers if we insert new records, while the former approach -can- have id collisions (although I suspect that with UUIDs the chance is really small).

Could someone help me out figuring what approach one should use and when?

CodePudding user response:

Both identity columns (sequence generated values) and UUIDs will work for you. The collision probability for UUIDs is so small that you can safely bet on it (it is more likely that cosmic rays hit your memory and corrupt your sequence in a way that it returns duplicate values, or indeed that your database and all its backups are destroyed by a solar flare).

You can read my ruminations on the subject sequences vs. UUIDs here; my opinion is that sequences are usually preferable.

If you use INSERT ... RETURNING id, you can retrieve the auto-generated identifier, and you can reuse it with CTEs:

WITH pi AS (
   INSERT INTO personal_info (...) VALUES (...)
   RETURNING id
), t AS (
   INSERT INTO trans (...) VALUES (...)
   RETURNING id
)
INSERT INTO pi_trans (pi_id, trans_id)
SELECT pi.id, t.id
FROM pi CROSS JOIN t;
  • Related