How can I extract below parameters with Regex using the RE2 syntax (not all features are available)?
What I would need is to extract (different Regex for each):
- Parameter that appears after third occurrence of
_
from the end of the string and until second occurrence of_
(from the end of the string) - Parameter that appears after second occurrence of
_
from the end of the string and until first occurrence of_
(from the end of the string) - Parameter that appears after first occurrence of
_
from the end of the string and until first occurrence of.
in the string (from the end of the string) - Everything up until first occurrence of
.
from the end of the string
Lets say that we have below string:
- Parameter1_Parameter3_Parameter4_ParamaterA_ParameterB_ParamaterC.mp4
In this case I want to extract:
- ParameterA
- ParameterB
- ParameterC
- mp4
Couple of notes:
- There can be a lot of additional strings in between "_" at the start of the string
- Parameter A, ParameterB, ParameterC and "." position are always in same spot (if looking from the end of the string)
- mp4 can be of any length, it's not fixed to 3 characters
Best thing I did so far is to extract _ParameterA_ParameterB_ParameterC.mp4 with ((?:_[^_]*){3})$
but that's not what I need.
Also figured out how to pull ".mp4" ( (\..*)$
)but can't figure out how to get it without .
.
I figured out how to pull "mp4" with RE2. It's ([^.]*)$
.
CodePudding user response:
Here are four suitable regular expressions utilizing positive lookarounds. Let me know if they work:
"(?<=\_)[^_] (?=_[^_] _[^_] \.)"
"(?<=\_)[^_] (?=_[^_] \.)"
"(?<=\_)[^_] (?=\.)"
"(?<=\.).*$"
As Google Data Studio cannot implement lookaround, here is an alternative workaround with multiple steps, which is written in R but can be translated to your language of choice:
text1 <- "Parameter1_Parameter3_Parameter4_ParamaterA_ParameterB_ParamaterC.mp4"
last_three <- str_extract(text1, "[^_] _[^_] _[^_] \\.. ")
str_extract(last_three, "^[^_] ")
str_replace_all(str_extract(last_three, "_[^_\\.] _"), "_", "")
str_replace(str_extract(last_three, "[^_\\.] \\."), "\\.", "")
str_replace(str_extract(last_three, "\\.. $"), "\\.", "")
https://support.google.com/datastudio/table/6379764?hl=en
Google Data Studio has the required commands for this: REGEXP_EXTRACT and REGEXP_REPLACE.
CodePudding user response:
You can use a single capture group to match the first parameter and always capture the extension at the end of the string.
For the second to n parameters use optional capture groups.
If you don't want to cross newlines, you could change the character class to [^_\r\n]*
_([^_]*)(?:_([^_]*)(?:_([^_]*))?)?\.(\w )$
_([^_]*)
Match_
and capture 0 times any char except_
in group 1(?:
Non capture group_([^_]*)
Match_
and capture 0 times any char except_
in group 2(?:
Non capture group_([^_]*)
Match_
and capture 0 times any char except_
in group 3
)?
Close non capture group and make it optional
)?
Close the whole non capture group and make it optional\.(\w )
Match a dot and 1 word chars$
End of string