Weighted Random Selection-CodePudding

Please. I have two tables with the most common first and last names. Each table has basically two fields:

Tables

CREATE TABLE "common_first_name" (
    "first_name" text PRIMARY KEY, --The text representing the name
    "ratio" numeric NOT NULL, -- the % of how many times it occurs compared to the other names.     
    "inserted_at" timestamp WITH time zone DEFAULT timezone('utc'::text, now()) NOT NULL,
    "updated_at" timestamp WITH time zone DEFAULT timezone('utc'::text, now()) NOT NULL
);

CREATE TABLE "common_last_name" (
    "last_name" text PRIMARY KEY, --The text representing the name
    "ratio" numeric NOT NULL, -- the % of how many times it occurs compared to the other names.     
    "inserted_at" timestamp WITH time zone DEFAULT timezone('utc'::text, now()) NOT NULL,
    "updated_at" timestamp WITH time zone DEFAULT timezone('utc'::text, now()) NOT NULL
);

P.S: The TOP 1 name occurs only ~ 1.8% of the time. The tables have 1000 rows each.

Function (Pseudo, not READY)

CREATE OR REPLACE FUNCTION create_sample_data(p_number_of_records INT)
    RETURNS VOID
    AS $$
DECLARE
    SUM_OF_WEIGHTS CONSTANT INT := 100;
BEGIN

    FOR i IN 1..coalesce(p_number_of_records, 0) LOOP
      --Get the random first and last name but taking in consideration their probability (RATIO)round(random()*SUM_OF_WEIGHTS); 
      --create_person (random_first_name || ' ' || random_last_name);
    END LOOP;
END
$$
LANGUAGE plpgsql VOLATILE;

P.S.: The sum of all ratios for each name (per table) sums up to 100%.

I want to run a function N times and get a name and a surname to create sample data... both tables have 1000 rows each.

The sample size can be anywhere from 1000 full names to 1000000 names, so if there is a "fast" way of doing this random weighted function, even better.

Any suggestion of how to do it in PL/PGSQL?

I am using PG 13.3 on SUPABASE.IO.

Thanks

CodePudding user response：

Given the small input dataset, it's straightforward to do this in pure SQL. Use CTEs to build lower & upper bound columns for each row in each of the common_FOO_name tables, then use generate_series() to generate sets of random numbers. Join everything together, and use the random value between the bounds as the WHERE clause.

with first_names_weighted as (
  select first_name,
    sum(ratio) over (order by first_name) - ratio as lower_bound,
    sum(ratio) over (order by first_name) as upper_bound
  from common_first_name
),
last_names_weighted as (
  select last_name,
    sum(ratio) over (order by last_name) - ratio as lower_bound,
    sum(ratio) over (order by last_name) as upper_bound
  from common_last_name
),
randoms as (
  select random() * (select sum(ratio) from common_first_name) as f_random,
         random() * (select sum(ratio) from common_last_name) as l_random
  from generate_series(1, 32)
)
select r, first_name, last_name
from randoms r
cross join first_names_weighted f
cross join last_names_weighted l
where f.lower_bound <= r.f_random and r.f_random <= f.upper_bound
  and l.lower_bound <= r.l_random and r.l_random <= l.upper_bound;

Change the value passed to generate_series() to control how many names to generate. If it's important that it be a function, you can just use a LANGAUGE SQL function definition to parameterize that number:

https://www.db-fiddle.com/f/mmGQRhCP2W1yfhZTm1yXu5/3