I would like to calculate the new visitors based on all the visitors that visit each day. Right now the only data I have available is the first 2 columns, so I need to extrapolate the last 2 columns based on the first 2 columns.
Date | Visitors_Today | New_Visitors | All_Visitors_To_Date |
---|---|---|---|
Dec 6 | Allie, Jon | Allie, Jon | Allie, Jon |
Dec 7 | Allie, Jon, Zach | Zach | Allie, Jon, Zach |
Dec 8 | Barb, Jon | Barb | Allie, Barb, Jon, Zach |
Dec 9 | Janet, Zach | Janet | Allie, Barb, Janet, Jon, Zach |
This is what I have so far to create the first and second column
WITH visitor_log_response AS (
SELECT
CAST(JSON_PARSE(visitor_log) AS MAP<VARCHAR, VARCHAR>) AS visitor_map,
date
FROM visitor_log_response_table
),
names_and_dates AS (
SELECT DISTINCT
visitor_name AS visitor_name,
date
FROM visitor_log_response
CROSS JOIN UNNEST(visitor_map) AS u(visitor_name, visitor_age)
),
visitor_names AS (
SELECT
date,
ARRAY_JOIN(
ARRAY_AGG(
visitor_name
ORDER BY
visitor_name
),
','
) visitors_today,
FROM names_and_dates
GROUP BY
date
ORDER BY
date DESC
)
SELECT
date,
visitors_today
FROM visitor_names
Which results in this
Date | Visitors_Today |
---|---|
Dec 6 | Allie, Jon |
Dec 7 | Allie, Jon, Zach |
Dec 8 | Barb, Jon |
Dec 9 | Janet, Zach |
If the table is normalized using this query
SELECT ds, visitors_today_split
FROM previous_table
CROSS JOIN UNNEST(SPLIT(visitors_today),',')) as (visitors_today_split)
I would have this output
Date | Visitors_Today |
---|---|
Dec 6 | Allie |
Dec 6 | Jon |
Dec 7 | Allie |
Dec 7 | Jon |
Dec 7 | Zach |
Dec 8 | Barb |
Dec 8 | Jon |
Dec 9 | Janet |
Dec 9 | Zach |
CodePudding user response:
You can flatten your arrays and perform new aggregations in subqueries:
select t.*, (select array_join(array_agg(t1.v), ', ') from
(select v from unnest(t.visitors_today) v
except
select v from tbl t2 cross join unnest(t2.visitors_today) v
where t2.date < t.date) t1),
(select array_join(array_distinct(array_agg(v)), ', ')
from tbl t1 cross join unnest(t1.visitors_today)) v
where t1.date <= t.date)
from tbl t
See fiddle (Demo of query above in Postgres).
CodePudding user response:
You can use window functions with array aggregation (remove ARRAY_JOIN
from visitor_names
CTE):
-- sample data
with dataset(date, visitors_today) as (
values ('Dec 6', array['Allie', 'Jon']),
('Dec 7', array['Allie', 'Jon', 'Zach']),
('Dec 8', array['Barb', 'Jon']),
('Dec 9', array['Janet', 'Zach'])
)
-- query
select date,
visitors_today,
array_distinct(visitors_today || prev_visitors) all_visitors_to_date,
array_except(visitors_today, prev_visitors) new_visitors
from (
select *,
coalesce(
flatten(array_distinct(array_agg(visitors_today)
over (order by date rows between UNBOUNDED PRECEDING and 1 PRECEDING))),
array[]) as prev_visitors -- combine all visitors before today into non null array
from dataset);
Output:
date | visitors_today | all_visitors_to_date | new_visitors |
---|---|---|---|
Dec 6 | [Allie, Jon] | [Allie, Jon] | [Allie, Jon] |
Dec 7 | [Allie, Jon, Zach] | [Allie, Jon, Zach] | [Zach] |
Dec 8 | [Barb, Jon] | [Barb, Jon, Allie, Zach] | [Barb] |
Dec 9 | [Janet, Zach] | [Janet, Zach, Allie, Jon, Barb] | [Janet] |
Note that arrays can be not that optimal type in terms of performance and a limited to 10000 elements in Presto/Trino.