How to get a regex group between two substring when the second substring is optional-CodePudding

So for example I have 6 strings as follows:

https://twitter.com/test1
http://twitter.com/test2
https://www.twitter.com/test3?
https://www.mobile.twitter.com/test4
https://www.twitter.com/test5?lang=en
https://www.instagram.com/test1insta

And what I want to do is extract the twitter 'username' from these links. So in this case I would like to search each link with regex to get the username after twitter.com/ and in the cases where the links have a ? for url parameters i would like to get everything before it.

For example it would come out like this:

test1 test2 test3 test4 test5

I have used search to get the pattern but I am struggling with how to get it to just extract the part I want. Here is what I have tried:

username = re.search(r'twitter.com\/(.*)\?', stringsList)

This results in only matching those strings that have a question mark after them which i understand. so just test3 and test5.

I thought I would try making the question mark optional by doing this:

username = re.search(r'twitter.com\/(.*)\??', stringsList)

but instead that just returns all of the usernames with all the additional stuff I want, e.g:

test1 test2 test3? test4 test5?lang=en

But I want it to still extract just the username as group 1 even though the ? should be optional.

What would my regex expression look like for me to do that or do I need to split this up and check if the string has a question mark first and use two different searches based on if its present or not?

I have a test bit of code here

and i've been trying to use this to determine the regex I would like

CodePudding user response：

To be domain agnostic:

(?:https?:\/\/)?(?:[^?\/\s] [?\/])([a-zA-Z0-9]*)

The username should be in group 1. A modified version from this answer which has a couple of other good methods.

I changed the last filter, doesn't include special characters. If underscores are valid, then you can just add to the last capture group:

(?:https?:\/\/)?(?:[^?\/\s] [?\/])([a-zA-Z0-9_]*)

or something like this to get everything up to the ?:

(?:https?:\/\/)?(?:[^?\/\s] [?\/])(.*?)\?

CodePudding user response：

You can use a lookaround to avoid matching the first part. Then limit the match on the righthand side to any character other than "?" and spaces.

(?<=twitter.com\/)[^?\s]

Your python code can be simplified by removing the group catching (username.group(1) becomes username) as follows:

twittercount = 0
NOTtwittercount = 0
for twitterURL in twitterURLs:
    if (twitterURL.twitter_url and 'twitter.com' in twitterURL.twitter_url):
        twittercount  = 1
        username = re.search(r'(?<=twitter.com\/)[^?\s] ', twitterURL.twitter_url)
        print("correct twitter link =", twitterURL.twitter_url)
        print("extracted username =", username)
    else:
        NOTtwittercount  = 1
        print("incorrect twitter link =", twitterURL.twitter_url)

Regex demo here. Python demo here.