Home > Software design >  GROUP by Largest String for all the substrings
GROUP by Largest String for all the substrings

Time:07-24

I have a table like this where some rows have the same grp but different names. I want to group them by name such that all the substrings after removing nonalphanumeric characters are aggregated together and grouped by the largest string. The null value is considered the substring of all the strings.

grp name value
1 ab&c 10
1 abc d e 56
1 ab 21
1 a 23
1 [null] 1
2 fgh 87

Desired result

grp name value
1 abcde 111
2 fgh 87

My query-

Select grp, 
regexp_replace(name,'[^a-zA-Z0-9] ', '', 'g') name, sum(value) value
from table
group by grp,
regexp_replace(name,'[^a-zA-Z0-9] ', '', 'g');

Result

grp name value
1 abc 10
1 abcde 56
1 ab 21
1 a 23
1 [null] 1
2 fgh 87

What changes should I make in my query?

CodePudding user response:

To solve this problem, I did the following (all of the code below is available on the fiddle here):

CREATE TABLE test 
(
  grp  SMALLINT NOT NULL,
  name TEXT         NULL,
  value SMALLINT NOT NULL
);

and populate it using your data:

INSERT INTO test VALUES
(1, 'ab&c',     10),
(1, 'abc d e',  56),
(1, 'ab',   21),
(1, 'a',        23),
(1, NULL,        1),
(2, 'fgh',      87);

Then, you can:

WITH t1 AS 
(
  SELECT
    grp,
    name,
    REGEXP_REPLACE(name,'[^a-zA-Z0-9] ', '', 'g') AS n_str,
    CASE
      WHEN name IS NULL THEN 0
      ELSE LENGTH(REGEXP_REPLACE(name,'[^a-zA-Z0-9] ', '', 'g'))
    END AS n_len,
    value
  FROM test
)
SELECT * FROM t1;

Result:

grp    name     n_str   n_len   value
  1    ab&c       abc       3      10
  1    abc d e  abcde       5      56
  1    ab          ab       2      21
  1    a            a       1      23
  1    NULL      NULL       0       1
  2    fgh        fgh       3      87
6 rows

Then:

WITH t1 AS 
(
  SELECT
    grp,
    name,
    REGEXP_REPLACE(name,'[^a-zA-Z0-9] ', '', 'g') AS n_str,
    CASE
      WHEN name IS NULL THEN 0
      ELSE LENGTH(REGEXP_REPLACE(name,'[^a-zA-Z0-9] ', '', 'g'))
    END AS n_len,
    value
  FROM test
),
t2 AS
(
  SELECT 
    t1.grp, 
    MAX(t1.n_len), 
    SUM(t1.value)
  FROM t1
  GROUP BY t1.grp
)
SELECT * FROM t2
ORDER BY grp;

Result:

grp     max     sum
  1       5     111
  2       3      87

And finally:

WITH t1 AS 
(
  SELECT
    grp,
    name,
    REGEXP_REPLACE(name,'[^a-zA-Z0-9] ', '', 'g') AS n_str,
    CASE
      WHEN name IS NULL THEN 0
      ELSE LENGTH(REGEXP_REPLACE(name,'[^a-zA-Z0-9] ', '', 'g'))
    END AS n_len,
    value
  FROM test
),
t2 AS
(
  SELECT 
    t1.grp, 
    MAX(t1.n_len) AS m_len, 
    SUM(t1.value) AS s_val
  FROM t1
  GROUP BY t1.grp
)
SELECT
  t2.grp, t2.m_len, t2.s_val, t1.n_str
FROM t2
JOIN t1
  ON t2.grp = t1.grp
WHERE t2.m_len = LENGTH(t1.n_str)
ORDER BY grp;

Result:

grp     m_len   s_val   n_str
  1         5     111   abcde
  2         3      87   fgh

A few points to note:

  • Please always provide a fiddle when you ask questions such as this one with tables and data - it provides a single source of truth for the question and eliminates duplication of effort on the part of those trying to help you!

  • You haven't been very clear about what, exactly, should happen with NULLs - do the values count towards the SUM()? You can vary the CASE statement as required.

  • What happens when there's a tie in the number of characters in the string? I've included an example in the fiddle, where you get the draws - but you may wish to sort alphabetically (or some other method)?

  • There appears to be an error in your provided sums for the values (even taking account of counting or not values for NULL for the name field).

  • Finally, you don't want to GROUP BY the largest string - you want to GROUP BY the grp fields the SUM() of the values in the the given grp records and then pick out the longest alphanumeric string in that grouping. It would be interesting to know why you want to do this?

  • Related