So for example I have 6 strings as follows:
https://twitter.com/test1
http://twitter.com/test2
https://www.twitter.com/test3?
https://www.mobile.twitter.com/test4
https://www.twitter.com/test5?lang=en
https://www.instagram.com/test1insta
And what I want to do is extract the twitter 'username' from these links. So in this case I would like to search each link with regex to get the username after twitter.com/
and in the cases where the links have a ?
for url parameters i would like to get everything before it.
For example it would come out like this:
test1
test2
test3
test4
test5
I have used search to get the pattern but I am struggling with how to get it to just extract the part I want. Here is what I have tried:
username = re.search(r'twitter.com\/(.*)\?', stringsList)
This results in only matching those strings that have a question mark after them which i understand. so just test3
and test5
.
I thought I would try making the question mark optional by doing this:
username = re.search(r'twitter.com\/(.*)\??', stringsList)
but instead that just returns all of the usernames with all the additional stuff I want, e.g:
test1
test2
test3?
test4
test5?lang=en
But I want it to still extract just the username as group 1 even though the ? should be optional.
What would my regex expression look like for me to do that or do I need to split this up and check if the string has a question mark first and use two different searches based on if its present or not?
I have a test bit of code here
and i've been trying to use this to determine the regex I would like
CodePudding user response:
To be domain agnostic:
(?:https?:\/\/)?(?:[^?\/\s] [?\/])([a-zA-Z0-9]*)
The username should be in group 1. A modified version from this answer which has a couple of other good methods.
I changed the last filter, doesn't include special characters. If underscores are valid, then you can just add to the last capture group:
(?:https?:\/\/)?(?:[^?\/\s] [?\/])([a-zA-Z0-9_]*)
or something like this to get everything up to the ?
:
(?:https?:\/\/)?(?:[^?\/\s] [?\/])(.*?)\?
CodePudding user response:
You can use a lookaround to avoid matching the first part. Then limit the match on the righthand side to any character other than "?" and spaces.
(?<=twitter.com\/)[^?\s]
Your python code can be simplified by removing the group catching (username.group(1)
becomes username
) as follows:
twittercount = 0
NOTtwittercount = 0
for twitterURL in twitterURLs:
if (twitterURL.twitter_url and 'twitter.com' in twitterURL.twitter_url):
twittercount = 1
username = re.search(r'(?<=twitter.com\/)[^?\s] ', twitterURL.twitter_url)
print("correct twitter link =", twitterURL.twitter_url)
print("extracted username =", username)
else:
NOTtwittercount = 1
print("incorrect twitter link =", twitterURL.twitter_url)