Home > Blockchain >  How can I extract everything in between "_" characters starting with nth occurrence of sai
How can I extract everything in between "_" characters starting with nth occurrence of sai

Time:07-26

How can I extract below parameters with Regex using the RE2 syntax (not all features are available)?

What I would need is to extract (different Regex for each):

  1. Parameter that appears after third occurrence of _ from the end of the string and until second occurrence of _ (from the end of the string)
  2. Parameter that appears after second occurrence of _ from the end of the string and until first occurrence of _ (from the end of the string)
  3. Parameter that appears after first occurrence of _ from the end of the string and until first occurrence of . in the string (from the end of the string)
  4. Everything up until first occurrence of . from the end of the string

Lets say that we have below string:

  • Parameter1_Parameter3_Parameter4_ParamaterA_ParameterB_ParamaterC.mp4

In this case I want to extract:

  1. ParameterA
  2. ParameterB
  3. ParameterC
  4. mp4

Couple of notes:

  • There can be a lot of additional strings in between "_" at the start of the string
  • Parameter A, ParameterB, ParameterC and "." position are always in same spot (if looking from the end of the string)
  • mp4 can be of any length, it's not fixed to 3 characters

Best thing I did so far is to extract _ParameterA_ParameterB_ParameterC.mp4 with ((?:_[^_]*){3})$ but that's not what I need.

Also figured out how to pull ".mp4" ( (\..*)$ )but can't figure out how to get it without ..


I figured out how to pull "mp4" with RE2. It's ([^.]*)$.

CodePudding user response:

Here are four suitable regular expressions utilizing positive lookarounds. Let me know if they work:

"(?<=\_)[^_] (?=_[^_] _[^_] \.)"
"(?<=\_)[^_] (?=_[^_] \.)"
"(?<=\_)[^_] (?=\.)"
"(?<=\.).*$"

As Google Data Studio cannot implement lookaround, here is an alternative workaround with multiple steps, which is written in R but can be translated to your language of choice:

text1 <- "Parameter1_Parameter3_Parameter4_ParamaterA_ParameterB_ParamaterC.mp4"

last_three <- str_extract(text1, "[^_] _[^_] _[^_] \\.. ")

str_extract(last_three, "^[^_] ")

str_replace_all(str_extract(last_three, "_[^_\\.] _"), "_", "")

str_replace(str_extract(last_three, "[^_\\.] \\."), "\\.", "")

str_replace(str_extract(last_three, "\\.. $"), "\\.", "")

https://support.google.com/datastudio/table/6379764?hl=en

Google Data Studio has the required commands for this: REGEXP_EXTRACT and REGEXP_REPLACE.

CodePudding user response:

You can use a single capture group to match the first parameter and always capture the extension at the end of the string.

For the second to n parameters use optional capture groups.

If you don't want to cross newlines, you could change the character class to [^_\r\n]*

_([^_]*)(?:_([^_]*)(?:_([^_]*))?)?\.(\w )$
  • _([^_]*) Match _ and capture 0 times any char except _ in group 1
  • (?: Non capture group
    • _([^_]*) Match _ and capture 0 times any char except _ in group 2
    • (?: Non capture group
      • _([^_]*) Match _ and capture 0 times any char except _ in group 3
    • )? Close non capture group and make it optional
  • )? Close the whole non capture group and make it optional
  • \.(\w ) Match a dot and 1 word chars
  • $ End of string

Regex demo

  • Related