How to use regexp_replace() with GROUP BY clause in presto query-CodePudding

I am trying to retrieve records based on a custom field "ci_ku". For the same values of "ci_ku" we will be having multiple "l1m_visits", and I want to retrieve the minimum value of "l1mvisits" for each "ci_ku".

Sample Data:

ku	ci_ku	l1m_visits
1234-5678-HIJK	1234-HIJK	A
1234-9012-HIJK	1234-HIJK	B

Expected Output:

ku	ci_ku	l1m_visits
1234-5678-HIJK	1234-HIJK	A

Have tried the query below:

SELECT DISTINCT REGEXP_REPLACE(ku, CONCAT('-',CAST(v_nbr AS varchar)), '') AS ci_ku,
                ku,
                MIN(l1m_visits),
                last_refresh_date 
FROM db.schema.table 
GROUP BY ci_ku;

and facing the following error:

line 1:194: Column 'ci_ku' cannot be resolved

CodePudding user response：

That error is fired because the field "ci_ku" is not yet generated when the GROUP BY clause is evaluated. Further there are some more issues in your query:

not all non-aggregated rows are found within the GROUP BY clause ("ku" and "last_refresh_date" should be included)
the DISTINCT keyword will remove duplicate rows, though there are none after your SELECT statement.

Instead of using aggregation, the ROW_NUMBER window function may get your result faster. It will generate an incremental number for each of your "ci_ku" values (PARTITION BY ci_ku) and ordered by "l1m_visits" (ORDER BY ci_ku), such that your row number equal to 1 will represent the lowest "l1m_visits" for each "ci_ku".

WITH tab_with_ci_ku AS (
    SELECT REGEXP_REPLACE(ku, CONCAT('-',CAST(v_nbr AS varchar)), '') AS ci_ku,
           ku,
           l1m_visits,
           last_refresh_date
    FROM db.schema.table 
), ranked_visits AS (
    SELECT *, ROW_NUMBER() OVER(PARTITION BY ci_ku ORDER BY l1m_visits) AS rn
    FROM tab_with_ci_ku
)
SELECT ku,
       ci_ku,
       l1m_visits
FROM ranked_visits
WHERE rn = 1

If you're using PostgreSQL, you can also use the FETCH n ROWS WITH TIES clause that retrieves the first row for each tied row number (it will pick the each row number = 1):

WITH tab_with_ci_ku AS (
    SELECT REGEXP_REPLACE(ku, CONCAT('-',CAST(v_nbr AS varchar)), '') AS ci_ku,
           ku,
           l1m_visits,
           last_refresh_date
    FROM db.schema.table 
)
SELECT ku,
       ci_ku,
       l1m_visits
FROM ranked_visits
ORDER BY ROW_NUMBER() OVER(PARTITION BY ci_ku ORDER BY l1m_visits)
FETCH FIRST 1 ROWS WITH TIES;