I have been using the row_number()
function to only select the observations that I need.
In my scenario, whenever there is two different name for a particular <id, entity_id, period, element>
, the National one should be left-out. In case there is only one, take the only one.
---- ----------- -------- --------------- ---------------------------
| id | entity_id | period | element | name |
---- ----------- -------- --------------- ---------------------------
| 12 | ABC123 | 2021 | Overall value | National Compatible - XYZ |
| 12 | ABC123 | 2021 | Overall value | Overall Estimation |
---- ----------- -------- --------------- ---------------------------
With cases like above, the following did the trick:
SELECT *
FROM (SELECT *,
Row_number()
OVER (
partition BY id, entity_id, period, element
ORDER BY NAME DESC) AS rn
FROM mydata) table
WHERE table.rn = 1
Problem is that now there are other cases like the following:
---- ----------- -------- --------------- ---------------------------
| id | entity_id | period | element | name |
---- ----------- -------- --------------- ---------------------------
| 12 | ABC123 | 2021 | Overall value | National Based - ZYX |
| 12 | ABC123 | 2021 | Overall value | Base Estimation |
---- ----------- -------- --------------- ---------------------------
And with the current SQL this would not work as I would have to change the order by from descending to ascending.
Is there any possibility to de-prioritize the "National..." record and take the other one in case there are multiple ones?
I am running the query on Hive/Impala.
CodePudding user response:
If you add another derived-table layer (or use a CTE) then you can add a CASE WHEN
to check for "name"
starting with 'National'
and give it a simple integer "tag" value you can use to de-prioritize those rows.
...like so:
WITH q AS (
SELECT
"id",
"entity_id",
"period",
"element",
"name",
CASE WHEN "name" LIKE 'National%' THEN 1 ELSE 2 END AS "tag"
FROM
mydata
),
filtered AS (
SELECT
*,
ROW_NUMBER() OVER (
PARTITION BY
"id", "entity_id", "period", "element"
ORDER BY
"tag" DESC,
"name" DESC
) AS rn
FROM
q
)
SELECT
*
FROM
filtered
WHERE
rn = 1