Regex group doesn't capture all of matched part of string [duplicate]-CodePudding

I have the following regex: '(/[a-zA-Z] )*/([a-zA-Z] )\.?$'.

Given a string the following string "/foo/bar/baz", I expect the first captured group to be "/foo/bar". However, I get the following:

>>> import re
>>> regex = re.compile('(/[a-zA-Z] )*/([a-zA-Z] )\.?$');
>>> match = regex.match('/foo/bar/baz')
>>> match.group(1)
'/bar'

Why isn't the whole expected group being captured?

Edit: It's worth mentioning that the strings I'm trying to match are parts of URLs. To give you an idea, it's the part of the URL that would be returned from window.location.pathname in javascript, only without file extensions.

CodePudding user response：

This will capture multiple repeated groups:

(/[a-zA-Z] )*

However, as already discussed in another thread, quoting from @ByteCommander

If your capture group gets repeated by the pattern (you used the quantifier on the surrounding non-capturing group), only the last value that matches it gets stored.

Thus the reason why you are only seeing the last match "/bar". What you can do instead is take advantage of the greedy matching of .* up to the last / via the pattern (/.*)/

regex = re.compile('(/.*)/([a-zA-Z] )\.?$');

CodePudding user response：

In this case, you may don't need regex. You can simply use split function.

text = "/foo/bar/baz"
"/".join(text.split("/", 3)[:3])

output:

/foo/bar

a.split("/", 3) splits your string up to the third occurrence of /, and then you can join the desidered elements.

As suggested by Niel, you can use a negative index to extract anything but the last part from a url (or a path).

In this case the generic approach would be :

text = "/foo/bar/baz/boo/bye"
"/".join(text.split("/", -1)[:-1])

Output:

/foo/bar/baz/boo

CodePudding user response：

Don't need the * between the two expressions here, also move the first / into the brackets:

>>> regex = re.compile('([/a-zA-Z] )/([a-zA-Z] )\.?$')
>>> regex.match('/foo/bar/baz').group(1)
'/foo/bar'
>>>