Home > Back-end >  Regex for getting region and lang from url /xx/xx in two groups
Regex for getting region and lang from url /xx/xx in two groups

Time:05-11

I have a url structure where the first subdirectory is the region and then the second optional one is the language overide:

https://example.com/no/en

I'm trying to get the two parts out in a group each. This way, in the JS, I can do the following to get each part of the url:

const pathname = window.location.pathname // '/no/en/mypage'
const match = pathname.match('xxx')
const region = match[1]   // 'no' or '/no'
const language = match[2] // 'en' or '/en'

I have tried creating multiple regexes with no luck in nailing all of my requirements below: This is the closest I have come, but it is prune to error due to also matching "/do" from /donotmatch with the following regex:

(\/[a-z]{2})(\/[a-z]{2})? The problem with this one is that it's also matching cases like /noada. I then tried to match first two a-z and then followed by either a forward slash or no characters like this: (\/[a-z]{2}\/|[^.])([a-z]{2}\/|[^.])? I think I am not getting the syntax correct for the not part.

The regex I am trying to create has to pass these criterias in order not to break:

  • /no - group 1 match(no), group 2 undefined
  • /no/ - group 1 match(no), group 2 undefined
  • /nona - no matches
  • /no/en - group 1 match(no), group 2 match(en)
  • /no/en/ - group 1 match(no), group 2 match(en)
  • /no/enen - group 1 match(no), group 2 undefined
  • /no/en/something - group 1 match(no), group 2 match(en)
  • /no/en/jp - group 1 match(no), group 2 match(en) (jp is not going to be matched)

I feel I am really close to a working solution, but all my tries so far have been off in a slight way.

If the group part is not possible, I suppose also getting /xx/xx and then splitting by / is also an option.

CodePudding user response:

You may use this regex with an optional 2nd capture group:

\/(\w{2})(?:\/(\w{2}))?(?:\/|$)

RegEx Demo

RegEx Explanation:

  • \/: Match starting /
  • (\w{2}): First capture group to match 2 word characters
  • (?:\/(\w{2}))?: Optional non-capture group that starts with a / followed by seconf capture group to match 2 word characters.
  • (?:\/|$): Match closing / or end of line

CodePudding user response:

Follow each capture with (?=$|/), which is a look ahead to assert that what comes next is either end of input or a slash.

https?://[^/] /(\w\w)(?=$|/)(?:/(\w\w)(?=$|/))?

See live demo.

The second capture is wrapped in an optional non-capture group via (?:…)?

To be more strict to allow only letters, replace \w with [a-z] but \w may be enough for your needs.

  • Related