Home > OS >  Unnest vs just having every row needed in table
Unnest vs just having every row needed in table

Time:03-15

I have a choice in how a data table is created and am wondering which approach is more performant.

  1. Making a table with a row for every data point,
  2. Making a table with an array column that will allow repeated content to be unnested

That is, if I have the data:

day val1 val2
Mon 7 11
Tue 7 11
Wed 8 9
Thu 1 4

Is it better to enter the data in as shown, or instead:

day val1 val2
(Mon,Tue) 7 11
(Wed) 8 9
(Thu) 1 4

And then use unnest() to explode those into unique rows when I need them?

Assume that we're talking about large data in reality - 100k rows of data generated every day x 20 columns. Using the array would greatly reduce the number of rows in the table but I'm concerned that unnest would be less performant than just having all of the rows.

CodePudding user response:

"100k rows of data generated every day x 20 columns"
And:
"the array would greatly reduce the number of rows" - so lots of duplicates.

Based on this I would suggest a third option:
Create a table with your 20 columns of data and add a surrogate bigint PK to it. To enforce uniqueness across all 20 columns, add a generated hash and make it UNIQUE. I suggest a custom function for the purpose:

-- hash function
CREATE OR REPLACE FUNCTION public.f_uniq_hash20(col1 text, col2 text, ... , col20 text)
  RETURNS uuid
  LANGUAGE sql IMMUTABLE COST 30 PARALLEL SAFE AS 
'SELECT md5(textin(record_out(($1,$2, ... ,$20))))::uuid';

-- data table
CREATE TABLE data (
  data_id bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY
, col1 text
, col2 text
, ...
, col20 text
, uniq_hash uuid GENERATED ALWAYS AS (public.f_uniq_hash20(col1, col2, ... , col20)) STORED
, CONSTRAINT data_uniq_hash_uni UNIQUE (uniq_hash)   
);

-- reference data_id in next table
CREATE TABLE day_data (
  day text 
, data_id bigint REFERENCES data ON UPDATE CASCADE -- FK to enforce referential integrity
, PRIMARY KEY (day, data_id)      -- must be unique?
);

db<>fiddle here

With only text columns, the function is actually IMMUTABLE (which we need!). For other data types (like timestamptz) it would not be.

In-depth explanation in this closely related answer:

You could use uniq_hash as PK directly, but for many references, a bigint is more efficient (8 vs. 16 bytes).

About generated columns:

Basic technique to avoid duplicates while inserting new data:

INSERT INTO data (col1, col2) VALUES
('foo', 'bam')
ON CONFLICT DO NOTHING
RETURNING *;

If there can be concurrent writes, see:

CodePudding user response:

I believe making a table with a row for every data point would be the option I would go for. As unnest for large amounts of data would be just as if not slower. Plus unless your data will be very repeated 20 columns is alot to align.

  • Related