Home > Back-end >  Fastest way to load millions of "dummy" records into postgresql Database to mimic PROD dat
Fastest way to load millions of "dummy" records into postgresql Database to mimic PROD dat

Time:11-06

I have millions of records in a production database (maybe 10 million per 5 tables) that I want to replicate into another database by just using dummy data, since I can't really export this data since it's a PROD DB and has employee info.

I was mostly inserting dummy data now by using a LOOP and this query

DO $$
DECLARE
 table1_id uuid;
 table2_id uuid;
 table3_id uuid;
BEGIN
   FOR counter IN 1..500000 LOOP
 
    INSERT INTO table1 
    VALUES ('random', 'data', 'here') 
    RETURNING id INTO table1_id; 
   
    INSERT INTO table2
    VALUES ('random', 'data', 'here')
    RETURNING id INTO table2_id;
   
    INSERT INTO table3 
     VALUES ('random', 'data', 'here')
    RETURNING id INTO table3_id; 
   
    INSERT INTO table4
    VALUES (random', 'data', 'here'); 
   
    INSERT INTO table5 
    VALUES ('random', 'data', 'here'); 
  
   END LOOP;
END; $$

I know inserting single records and iterating is super slow, but how would I be able to do this in batches? Also, is there even a better method than trying to load this in manually? As I stated before, it isn't realistic for me to export the PROD data and use it here.

The reason I want to do this, is I want to query this freshly loaded-DB with one of my APIs and test the latency time for each query, but I want to have the record count similar to that of PROD to keep some type of controlled environment, unless someone an suggest a better way of doing this. Thanks

CodePudding user response:

You can do everything in a single statement which is most probably the fastest way to do it:

with t1 as (
  INSERT INTO table1 
  select 'random', 'data', 'here'
  from generate_series(1,50000)
  RETURNING id
), t2 as ( 
  INSERT INTO table2
  select id, 'random', 'data', 'here'
  from t1
  returning id
), t3 as (   
  INSERT INTO table3 
  select id, 'random', 'data', 'here'
  from t2
  returning id
), t4 as (   
  INSERT INTO table4
  VALUES ('random', 'data', 'here')
)   
INSERT INTO table5 
VALUES ('random', 'data', 'here'); 

CodePudding user response:

Those queries are more arbitrary data than random data. Generating random data with the right properties to test realistically is a bit of an art (not to mention its own field of study in itself)

For random numbers, you can use random(), combined with arithmetic to scale it and floor() to convert it to integers if that is needed. If you need non-uniform distributions, you can also use sqrt() or ln() or power() to transform it.

For random strings, you can use something like md5(random()::text)

  • Related