Hi very new to regex but am struggling to make this work.
I have a variable in Data Studio that contains the page URLs. I need to keep only a subset of pages in a table - those that contain 4 or more backslashes:
For example, these URLs should be in the subset:
- /abc/state/region/place1
- /abc/state/region/place1/details
- /abc/territory/region/place2
- /abc/state/region/place3/details/more-specific
Whereas these URLs should be excluded:
- /abc/state/region
- /abc
- /abc/xyz-page
I thought something like \/{4,} would work but it doesn't seem to return any results
CodePudding user response:
First, those are slashes; backslashes go the other way (\
). It's important because backslashes are used in all sorts of syntactic ways in programming languages, including regular expressions: putting a backslash in front of a special character changes its meaning between whatever special function it has and just matching itself normally. In regexes, /
is not special, so you don't have to put a backslash in front of it (unless you're using a language where the regex itself is delimited by /
s on the outside of the whole thing, but since you're using Google Data Studio I'm pretty sure regexes are entered as plain strings.)
Second, the regex /{4,}
only matches four or more slashes with nothing else between them. So it matches ////
, but not ///hello/
, etc. Everything in the string – or at least, everything in the part of the string matched by the regex – has to be accounted for in the regex. Perhaps you could try something like this:
(?:[^/]*/[^/]*){4,}
which will match four or more repetitions of "a slash with some optional non-slashes before and/or after it". Note that this is a partial regex, designed for something like REGEXP_CONTAINS
; it doesn't match the whole string. If you want to turn it into something you can pass to REGEXP_MATCH
, just put a .*
on the front and back of it:
.*(?:[^/]*/[^/]*){4,}.*
See https://github.com/google/re2/wiki/Syntax for what GDS supports in its regexes.