query to remove everything between 1st and 2nd hypen for string in column using presto in hadoop-CodePudding

I am trying to retrieve records based on a custom field "ci_ku". For the same values of "ci_ku" we will be having multiple "l1m_visits", and I want to retrieve the minimum value of "l1mvisits" for each "ci_ku". and i want to get the ci_ku by removing the string between 1st and 2nd underscore(-) in ku, how to write a query to get the ci_ku here

Sample Data:

ku	item	l1m_visits
1234-5678-HIJK	1234	A
1234-9012-HIJK	1234	B
56457-12456-DF-GH-TC	56457	D

Expected Output:

ku	ci_ku	l1m_visits
1234-5678-HIJK	1234-HIJK	A
56457-12456-DF-GH-TC	56457-DF-GH-TC	D

Have tried the query below:

WITH tab_with_ci_ku AS (
   select split(ku, '-', 3)ivm_arr,

           l1m_visits,
           last_refresh_date
    FROM db.scema.table
), ranked_visits AS (
    SELECT *, ROW_NUMBER() OVER(PARTITION BY CONCAT(ivm_arr[2],item) as ci_sku ORDER BY l1m_visits) AS rn
    FROM tab_with_ci_ku
)
SELECT sku,ci_ku
FROM ranked_visits
WHERE rn = 1

and facing the following error:

 mismatched input 'ci_ku'. Expecting: 'ALTER', 'ANALYZE', 'CALL', 'COMMENT', 'COMMIT', 'CREATE', 'DEALLOCATE', 'DELETE', 'DENY', 'DESC', 'DESCRIBE', 'DROP', 'EXECUTE', 'EXPLAIN', 'GRANT', 'INSERT', 'MERGE', 'PREPARE', 'REFRESH', 'RESET', 'REVOKE', 'ROLLBACK', 'SET', 'SHOW', 'START', 'TRUNCATE', 'UPDATE', 'USE', <query>

CodePudding user response：

I would use a regex replacement to handle the ci_ku requirement. Note that this replacement should happen first, as the call to ROW_NUMBER depends on it.

WITH tab_with_ci_ku AS (
    SELECT *, REGEXP_REPLACE(ku, '([^-] )-[^-] -(.*)', '$1-$2') AS ci_ku
    FROM hdpsa30.pps.ivm_bands_boomerang_sos
),
ranked_visits AS (
    SELECT *, ROW_NUMBER() OVER (PARTITION BY ci_ku ORDER BY l1m_visits) AS rn
    FROM tab_with_ci_ku
)

SELECT ku, ci_ku, l1m_visits
FROM ranked_visits
WHERE rn = 1;

Here is an explanation of the regex:

([^-] ) match and capture in $1 the first term
- match a literal hyphen
[^-] match the second term (to be excluded in the replacement)
- match a literal hyphen
(.*) match and capture in $2 the remainder of the SKU

Then we replace with $1-$2 to effectively splice out the second term.