How to sort values in arrays, aggregate it, efficiently way to count the similar records in SQL-CodePudding

I need to count and aggregate by types properties from the table.

These are some examples of "Title" column:

Row	Title
1	WATLINGTON STATION - 9 HOUSES AND INDUSTRIAL UNITS
2	STATION ROAD YATE - 1 BUNGALOW & 2 HOUSES
3	THE OLD VICARAGE CARR LANE - HOUSE & STABLE BLOCK

So, then to clean these rows I've used REPLACE() AND REGEX() functions:

SELECT SPLIT(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(TRIM(REGEXP_REPLACE(REGEXP_SUBSTR(REPLACE("Title",'-2',''), '[^-]*$'), '[0-9] ')),'/', '&'), ' &  ','&'),' AND ','&'),' & ','&'),'& ','&'),'.',''),'&') AS "Category",
COUNT("Category") AS "Frequency"
FROM DATA.PROPERTIES
GROUP BY "Category" 
ORDER BY "Frequency" DESC;

And I got rows like:

Category	Frequency
"HOUSES"	12432
"FLATS"	76736
"HOUSE"	23865
"APARTMENTS"	18814
"HOUSES", "FLATS"	3245
"FLATS", "HOUSES"	1342
"FLATS", "RETAIL UNIT"	362
"FLATS", "HOUSE"	461
"FLATS", "RETAIL"	890
"FLATS", "RETAIL UNITS"	287

I've gave some results. As you can see [ "HOUSES", "FLATS" ] and [ "FLATS", "HOUSES" ] should be joined into one, cause it similar and it should be like [ "HOUSES", "FLATS" ] and 4587.

So, I assume that somehow I should re-order every array in ASC or DESC order and aggregate it. And similar question about three rows like [ "FLATS", "RETAIL UNIT" ], [ "FLATS", "RETAIL" ], [ "FLATS", "RETAIL UNITS" ].

Do you have any ideas how to resolve it??

May be the way using ARRAY_AGG or something else. I have no any ideas now, but would like to resolve this question.

CodePudding user response：

Example showing reorder and aggregate - Idea is to first flatten the array and then reorder them while aggregating. Finally once they are in order simple grouping can be used, which will group all arrays that were previously like ['a','b'], ['b','a'] under one group.

with cte (category, frequency) as (
    select array_construct('HOUSES'),12432 union all
    select array_construct('FLATS'),76736 union all
    select array_construct('HOUSE'),23865 union all
    select array_construct('APARTMENTS'),18814 union all
    select array_construct('HOUSES', 'FLATS'),3245 union all
    select array_construct('FLATS', 'HOUSES'),1342 union all
    select array_construct('FLATS', 'RETAIL UNIT'),362 union all
    select array_construct('FLATS', 'HOUSE'),461 union all
    select array_construct('FLATS', 'RETAIL'),890 union all
    select array_construct('FLATS', 'RETAIL UNITS'),287
), cte_1 as
    (select seq,value,c1.frequency
        from cte c1,
        lateral flatten(input=>category)
        order by seq, value
    ), cte_2 as
        (select array_agg(value) category, frequency
            from cte_1
            group by seq,frequency)
    select category, sum(frequency)
        from cte_2
        group by category;

CATEGORY	SUM(FREQUENCY)
[ "FLATS", "HOUSE" ]	461
[ "FLATS", "RETAIL" ]	890
[ "HOUSE" ]	23865
[ "FLATS" ]	76736
[ "FLATS", "RETAIL UNITS" ]	287
[ "APARTMENTS" ]	18814
[ "FLATS", "RETAIL UNIT" ]	362
[ "HOUSES" ]	12432
[ "FLATS", "HOUSES" ]	4587

CodePudding user response：

I really like Panka's answer, but the ORDER BY should not be in the step of CTE and should be inside the ARRRAY_AGG because a) there is a command to solve this b) the order of CTE is unimportant, and relying on order of rows is a magic that will burn you.

Also the CTE2 and CTE can be merged into one step, if you alias the FLATEN you can see what is happening:

with fake_data_table (category, frequency) as (
    select split(column1, '|'), column2 
    from values 
        ('HOUSES', 12432),
        ('FLATS', 76736),
        ('HOUSE', 23865),
        ('APARTMENTS', 18814),
        ('HOUSES|FLATS', 3245),
        ('FLATS|HOUSES', 1342),
        ('FLATS|RETAIL UNIT', 362),
        ('FLATS|HOUSE', 461),
        ('FLATS|RETAIL', 890),
        ('FLATS|RETAIL UNITS', 287)
), part_a as (
    select array_agg(f.value) within group (order by f.seq, f.value) as category, 
        d.frequency
    from fake_data_table as d,
    lateral flatten(input=>category) f
    group by f.seq, d.frequency
)
select category, sum(frequency)
from part_a
group by category 
order by 1;

gives:

CATEGORY	SUM(FREQUENCY)
[ "APARTMENTS" ]	18,814
[ "FLATS" ]	76,736
[ "FLATS", "HOUSE" ]	461
[ "FLATS", "HOUSES" ]	4,587
[ "FLATS", "RETAIL" ]	890
[ "FLATS", "RETAIL UNIT" ]	362
[ "FLATS", "RETAIL UNITS" ]	287
[ "HOUSE" ]	23,865
[ "HOUSES" ]	12,432