Home > Software engineering >  query to remove everything between 1st and 2nd hypen for string in column using presto in hadoop
query to remove everything between 1st and 2nd hypen for string in column using presto in hadoop

Time:09-23

I am trying to retrieve records based on a custom field "ci_ku". For the same values of "ci_ku" we will be having multiple "l1m_visits", and I want to retrieve the minimum value of "l1mvisits" for each "ci_ku". and i want to get the ci_ku by removing the string between 1st and 2nd underscore(-) in ku, how to write a query to get the ci_ku here

Sample Data:

ku item l1m_visits
1234-5678-HIJK 1234 A
1234-9012-HIJK 1234 B
56457-12456-DF-GH-TC 56457 D

Expected Output:

ku ci_ku l1m_visits
1234-5678-HIJK 1234-HIJK A
56457-12456-DF-GH-TC 56457-DF-GH-TC D

Have tried the query below:

WITH tab_with_ci_ku AS (
   select split(ku, '-', 3)ivm_arr,

           l1m_visits,
           last_refresh_date
    FROM db.scema.table
), ranked_visits AS (
    SELECT *, ROW_NUMBER() OVER(PARTITION BY CONCAT(ivm_arr[2],item) as ci_sku ORDER BY l1m_visits) AS rn
    FROM tab_with_ci_ku
)
SELECT sku,ci_ku
FROM ranked_visits
WHERE rn = 1

and facing the following error:

 mismatched input 'ci_ku'. Expecting: 'ALTER', 'ANALYZE', 'CALL', 'COMMENT', 'COMMIT', 'CREATE', 'DEALLOCATE', 'DELETE', 'DENY', 'DESC', 'DESCRIBE', 'DROP', 'EXECUTE', 'EXPLAIN', 'GRANT', 'INSERT', 'MERGE', 'PREPARE', 'REFRESH', 'RESET', 'REVOKE', 'ROLLBACK', 'SET', 'SHOW', 'START', 'TRUNCATE', 'UPDATE', 'USE', <query>

CodePudding user response:

I would use a regex replacement to handle the ci_ku requirement. Note that this replacement should happen first, as the call to ROW_NUMBER depends on it.

WITH tab_with_ci_ku AS (
    SELECT *, REGEXP_REPLACE(ku, '([^-] )-[^-] -(.*)', '$1-$2') AS ci_ku
    FROM hdpsa30.pps.ivm_bands_boomerang_sos
),
ranked_visits AS (
    SELECT *, ROW_NUMBER() OVER (PARTITION BY ci_ku ORDER BY l1m_visits) AS rn
    FROM tab_with_ci_ku
)

SELECT ku, ci_ku, l1m_visits
FROM ranked_visits
WHERE rn = 1;

Here is an explanation of the regex:

  • ([^-] ) match and capture in $1 the first term
  • - match a literal hyphen
  • [^-] match the second term (to be excluded in the replacement)
  • - match a literal hyphen
  • (.*) match and capture in $2 the remainder of the SKU

Then we replace with $1-$2 to effectively splice out the second term.

  • Related