HIVE - How to extract hashtags from a string in HIVE-CodePudding

IF we have a sample field say 'note' note could be in ( 'xyz #abc-xyz-dfg. #abd' , '#1236&89'

, '## avc #1 "Changed xyz #2 from "string"\n - stop #3 #'

)

What I am doing is -REGEXP_EXTRACT( regexp_replace( upper(note),'.?(\\#\\w )' ' $1' ), '.(\\#\\w )', 0 )

But what it is doing is:

if we have a tag as #abc-xyz it's only fetching #abc instead of #abc-xyz

or a tag as #abc-xyz. #abcd it's only fetching #abcd instead of #abc-xyz #abcd

Or if we have #1 or #2, it's skipping it, giving nulls

or if we have #1236&89 , it's giving #123, we need complete

Edit: Solution :

REGEXP_EXTRACT(
      regexp_replace(
        upper(note),
        '.*?(#[\w?\d?\\`\\~!@#$%\\^&*-_ {}\\[\\]\\|//?;:\\"] )',
        ' $1'
      ),
      '.*(#[\w?\d?\\`\\~!@#$%\\^&*-_ {}\\[\\]\\|//?;:\\"] )',
      0
    )

CodePudding user response：

\w is equivalent to [a-zA-Z0-9_], which doesn't contain -, %, or &. Instead of using \w, you should define a character set that contains everything that you want to match, such as [a-zA-Z0-9_%&-].

Note that - must be included as either the first or last character in the set in order for it to treated as a literal. When it exists between two characters it defines a range.