how to remove duplicated records depend on column not able to sort in hive-CodePudding

I have :

table test contain :

unique_id string , file_name string , mount bigint

sample of date :

uniqu_id , file_name             , mount 
1        , test.txt              , 15
1        , test_R_file.txt       , 50
3        , test_567.txt          , 30
3        , test_567_R_file.txt   , 100

what I want to do :

I need query to insert overwrite the table where I need to keep for each duplicated uniqu_id one record and this record should be the ones that has (R in the file name column)

the issue :

test table is extrnal table in hive (that mean it not support update and delete operation ) so I want insert overwrite to remove duplicated records for each uniqu_id in the table (in case I have 2 records for the same unique_id only the record that has (R) in file name record should stay ) , I was think to use ranking but the idea I do not have column to order on it to knew what record should I keep and what record should I remove I just has the file_name column who should I check it in case I have 2 record has the same unique_id to knew which record should I keep and which should I remove

CodePudding user response：

You can sort by boolean expression does R exists in the filename or not using CASE expression. Also you can convert boolean to int in CASE and add more conditions to the CASE as well as add more orderby expressions, comma separated. You can sort by boolean because True is greater than False.

Demo:

with mytable as (--demo dataset, use your table instead of this CTE
select 1 unique_id , 'test.txt' file_name , 15 mount union all
select 1        , 'test_R_file.txt'       , 50 union all
select 3        , 'test_567.txt'          , 30 union all
select 3        , 'test_567_R_file.txt'   , 100
)

select unique_id, file_name, mount
from 
(
select unique_id, file_name, mount, 
       row_number() over(partition by unique_id
                             order by file_name rlike '_R_' desc --True is greater than False
                                      --order by something else if necessary
                        ) rn
       from mytable t
) s
where rn=1

Result:

unique_id   file_name            mount
1           test_R_file.txt      50
3           test_567_R_file.txt  100

Use rank instead of row_number if there are possible multiple records with R per unique_id and you want to keep all of them. Rank will assign 1 to all records with R, row_number will assign 1 to the only such record per unique_id.