Is there a way to scrape the page url (or a part of it) located in the address bar using selenium in-CodePudding

I'm working on a huge dataset of movies and I'm trying to get the IMDb ID of each movie from the IMDB website. I'm using selenium in Python. I checked, but inside the movie page you can't find the IMDB code. It is contained into the link of the page, which is in the address bar and I don't know how to scrape it. Are there any methods of doing this?

This is an example of the page:

I need to get the underlined part of the url.

Does anyone know how to do it?

CodePudding user response：

If you want to fetch the title of movie url you need to first fetch the current_url and then using python split() function you can get the second last string.

currenturl=driver.current_url.split("/")[-2]
print(currenturl)

This will returned tt1877830

CodePudding user response：

Try driver.current_url

Reference: https://selenium-python.readthedocs.io/api.html

Also, worth noting that IMDB has an official API. You could look at that as well https://aws.amazon.com/marketplace/pp/prodview-bj74roaptgdpi?sr=0-1&ref_=beagle&applicationId=AWSMPContessa

CodePudding user response：

To extract the page url 9or a part of it i.e. the underlined part) e.g. tt1877830, you can extract the current_url and split it with respect to the / character and you can use either of the following solutions:

Using Positive Index:

driver.get('https://www.imdb.com/title/tt1877830/?ref_=fn_al_tt_1')
WebDriverWait(driver, 20).until(EC.url_contains("title"))
print(driver.current_url.split("/")[4])

Console Output:
```
tt1877830
```

Using Negative Index:

driver.get('https://www.imdb.com/title/tt1877830/?ref_=fn_al_tt_1')
WebDriverWait(driver, 20).until(EC.url_contains("title"))
print(driver.current_url.split("/")[-2])

Console Output:
```
tt1877830
```

Note : You have to add the following imports :

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC