I have a table like this where some rows have the same grp but different names. I want to group them by name such that all the substrings after removing nonalphanumeric characters are aggregated together and grouped by the largest string. The null value is considered the substring of all the strings.
grp | name | value |
---|---|---|
1 | ab&c | 10 |
1 | abc d e | 56 |
1 | ab | 21 |
1 | a | 23 |
1 | [null] | 1 |
2 | fgh | 87 |
Desired result
grp | name | value |
---|---|---|
1 | abcde | 111 |
2 | fgh | 87 |
My query-
Select grp,
regexp_replace(name,'[^a-zA-Z0-9] ', '', 'g') name, sum(value) value
from table
group by grp,
regexp_replace(name,'[^a-zA-Z0-9] ', '', 'g');
Result
grp | name | value |
---|---|---|
1 | abc | 10 |
1 | abcde | 56 |
1 | ab | 21 |
1 | a | 23 |
1 | [null] | 1 |
2 | fgh | 87 |
What changes should I make in my query?
CodePudding user response:
To solve this problem, I did the following (all of the code below is available on the fiddle here):
CREATE TABLE test
(
grp SMALLINT NOT NULL,
name TEXT NULL,
value SMALLINT NOT NULL
);
and populate it using your data:
INSERT INTO test VALUES
(1, 'ab&c', 10),
(1, 'abc d e', 56),
(1, 'ab', 21),
(1, 'a', 23),
(1, NULL, 1),
(2, 'fgh', 87);
Then, you can:
WITH t1 AS
(
SELECT
grp,
name,
REGEXP_REPLACE(name,'[^a-zA-Z0-9] ', '', 'g') AS n_str,
CASE
WHEN name IS NULL THEN 0
ELSE LENGTH(REGEXP_REPLACE(name,'[^a-zA-Z0-9] ', '', 'g'))
END AS n_len,
value
FROM test
)
SELECT * FROM t1;
Result:
grp name n_str n_len value
1 ab&c abc 3 10
1 abc d e abcde 5 56
1 ab ab 2 21
1 a a 1 23
1 NULL NULL 0 1
2 fgh fgh 3 87
6 rows
Then:
WITH t1 AS
(
SELECT
grp,
name,
REGEXP_REPLACE(name,'[^a-zA-Z0-9] ', '', 'g') AS n_str,
CASE
WHEN name IS NULL THEN 0
ELSE LENGTH(REGEXP_REPLACE(name,'[^a-zA-Z0-9] ', '', 'g'))
END AS n_len,
value
FROM test
),
t2 AS
(
SELECT
t1.grp,
MAX(t1.n_len),
SUM(t1.value)
FROM t1
GROUP BY t1.grp
)
SELECT * FROM t2
ORDER BY grp;
Result:
grp max sum
1 5 111
2 3 87
And finally:
WITH t1 AS
(
SELECT
grp,
name,
REGEXP_REPLACE(name,'[^a-zA-Z0-9] ', '', 'g') AS n_str,
CASE
WHEN name IS NULL THEN 0
ELSE LENGTH(REGEXP_REPLACE(name,'[^a-zA-Z0-9] ', '', 'g'))
END AS n_len,
value
FROM test
),
t2 AS
(
SELECT
t1.grp,
MAX(t1.n_len) AS m_len,
SUM(t1.value) AS s_val
FROM t1
GROUP BY t1.grp
)
SELECT
t2.grp, t2.m_len, t2.s_val, t1.n_str
FROM t2
JOIN t1
ON t2.grp = t1.grp
WHERE t2.m_len = LENGTH(t1.n_str)
ORDER BY grp;
Result:
grp m_len s_val n_str
1 5 111 abcde
2 3 87 fgh
A few points to note:
Please always provide a fiddle when you ask questions such as this one with tables and data - it provides a single source of truth for the question and eliminates duplication of effort on the part of those trying to help you!
You haven't been very clear about what, exactly, should happen with
NULL
s - do thevalues
count towards theSUM()
? You can vary theCASE
statement as required.What happens when there's a tie in the number of characters in the string? I've included an example in the fiddle, where you get the draws - but you may wish to sort alphabetically (or some other method)?
There appears to be an error in your provided sums for the values (even taking account of counting or not values for
NULL
for thename
field).Finally, you don't want to
GROUP BY
the largest string - you want toGROUP BY
thegrp
fields theSUM()
of the values in the the givengrp
records and then pick out the longest alphanumeric string in that grouping. It would be interesting to know why you want to do this?