Home > Enterprise >  regex to get String between second underscore and last underscore
regex to get String between second underscore and last underscore

Time:02-21

I want to get all the character's between the second underscore and last underscore in a string any ideas how this could be accomplished. I will use this regex in the regex_extract function in spark.

Examples
Input                                                                                       Output                     
Problem_ISOAPAPattern_Pat_2nd_byUser-withAllRoles_351107b7-88eb-4232-9107-b788eb92325b    Pat_2nd_byUser-withAllRoles     

Problem_ISOACompressionPattern_pattern 7cbc_7cbce13c-0b25-49a4-bce1-3c0b2569a411          pattern 7cbc 
//doing spark code to extract
spark.sql("select regexp_extract(values, '^(?:[^_] _){2}([^_ ] )', 1) pname from san2").show(false)

demo URL: https://regex101.com/r/BipSW8/1

any idea what the regex should look like

CodePudding user response:

You can use

^(?:[^_] _){2}(. )_
^(?:[^_]*_){2}([^_]*(?:_[^_]* )*)_

The second regex is equl to

^(?:[^_]*_){2}([^_]*(?>_[^_]*)*)_

See regex #1 demo and regex #2 demo. The second one is more efficient than the first, as it does not allow much backtracking.

The first regex uses (. )_, a greedy dot pattern to grab the whole line first and then backtracking comes into play making the regex engine step back along the string yielding char by char to find the rightmost occurrence of _ and then give the result.

The second regex matches chars other than _ (with [^_]*) and then matches zero or more sequences of _ and then any zero or more chars other than _ as many as possible without allowing backtracking into the [^_]* pattern.

  • Related