Automate the removal of part of a URL-CodePudding

I have a list of Instagram URLs. I want to extract just the username from this list.

For example, the existing value is: ['https://www.instagram.com/thestackoverflow/?hl=en']

And from that, I'd like to have thestackoverflow be outputted.

The first part of the problem is removing https://www.instagram.com/ which should be simple enough, although I can't find out how after hours looking.

The more complex task would be removing the /?hl=en (if the link has one) as there are different variables it could be.

However, once the first part is figured out, I think this wouldn't be too much of an issue.

From research, I found that Instagram supports 25 languages. These will hopefully be using the same host language parameters as Google which are listed here.

I should be able to make a loop to check if there is a language modifier at the end and remove it.

I tried using urllib.parse but this didn't work. It doesn't split up the URL in any way. Here is an example of the result:

ParseResult(scheme='', netloc='', path="['https://www.instagram.com/thestackoverflow/']", params='', query='', fragment='')

If anyone could help I'd much appreciate it!

Incase its of any relevance, the list of URLs are coming from an xlsx sheet I'm interfacing with using openpyxl. The output is going to be in the adjacent cell.

CodePudding user response：

If every url has the same structure, you can use for example the question mark as a fix point and a a few splits form there.

url = 'https://www.instagram.com/thestackoverflow/?hl=en'
interesting_part = url.split("/?")[0].split("/")[-1]
print(interesting_part)

CodePudding user response：

if all the urls end with '.com' before the string you want to extract, you can use re.search :

import re

url = 'https://www.instagram.com/thestackoverflow/?hl=en'
result = re.search('.com/(.*)/', url)
print(result.group(1))

output:

thestackoverflow