Home > Back-end >  Is there a way to scrape the page url (or a part of it) located in the address bar using selenium in
Is there a way to scrape the page url (or a part of it) located in the address bar using selenium in

Time:03-25

I'm working on a huge dataset of movies and I'm trying to get the IMDb ID of each movie from the IMDB website. I'm using selenium in Python. I checked, but inside the movie page you can't find the IMDB code. It is contained into the link of the page, which is in the address bar and I don't know how to scrape it. Are there any methods of doing this?

This is an example of the page:

enter image description here

I need to get the underlined part of the url.

Does anyone know how to do it?

CodePudding user response:

If you want to fetch the title of movie url you need to first fetch the current_url and then using python split() function you can get the second last string.

currenturl=driver.current_url.split("/")[-2]
print(currenturl)

This will returned tt1877830

CodePudding user response:

Try driver.current_url

Reference: https://selenium-python.readthedocs.io/api.html

Also, worth noting that IMDB has an official API. You could look at that as well https://aws.amazon.com/marketplace/pp/prodview-bj74roaptgdpi?sr=0-1&ref_=beagle&applicationId=AWSMPContessa

CodePudding user response:

To extract the page url 9or a part of it i.e. the underlined part) e.g. tt1877830, you can extract the current_url and split it with respect to the / character and you can use either of the following solutions:

  • Using Positive Index:

    driver.get('https://www.imdb.com/title/tt1877830/?ref_=fn_al_tt_1')
    WebDriverWait(driver, 20).until(EC.url_contains("title"))
    print(driver.current_url.split("/")[4])
    
  • Console Output:

    tt1877830
    
  • Using Negative Index:

    driver.get('https://www.imdb.com/title/tt1877830/?ref_=fn_al_tt_1')
    WebDriverWait(driver, 20).until(EC.url_contains("title"))
    print(driver.current_url.split("/")[-2])
    
  • Console Output:

    tt1877830
    
  • Note : You have to add the following imports :

    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    
  • Related