Home > Enterprise >  scraping a page that updates using python selenium
scraping a page that updates using python selenium

Time:12-30

It has now been weeks that I was trying to scrape all the information in this website. This website is the profile of a company and I'm tying to get all the information in this id = "profile-basic" section and id="profile-addresses" section.

I am looping through 1000 0f these profiles and this is only one of them. The reason why Im not showing the code is because its very basic and it will not effect my question, though for the ones that want to know its just a simple for loop that goes through a list one by one.

The problem is a lot of the elements in the page don't appear in some profile but in other profiles they will. I tried solving that by writing down the xpath of all possible elements and than using try: to check all of them and it worked just find, the only problem was the xpath would not always be for one part of information for example the xpath for addres //*[@id="profile-addresses"]/div/div/div/div[1]/p but sometimes it could be //*[@id="profile-addresses"]/div/div/div/div[2]/p many other xpaths. Since im trying to put the address inside the address variable it will be impossible to tell which xpath will be for the address in that page.

I tried using this code:

    names = {"آدرس تولیدی :" : "Address", "آدرس دفتر :" : "Address", "تلفن :" : "Phone2",
              "تعداد پرسنل :" : "StaffNumber", "کدپستی :" : "PostalCode", "توضیحات :" : 
              "Description2"}
    try:
        e=browser.find_element(By.XPATH, '//*[@id="profile-addresses"]/div/div/div/div[1]/span').text
        _1 = names.get(e)
        __1 = browser.find_element(By.XPATH, '//*[@id="profile-addresses"]/div/div/div/div[1]/p').text
        exec(f"global {_1}\n{_1} = Smalify('{__1}')")
    except:
        pass
    try:
        e=browser.find_element(By.XPATH, '//*[@id="profile-addresses"]/div/div/div/div[2]/span').text
        _2 = names.get(e)
        __2 = browser.find_element(By.XPATH, '//*[@id="profile-addresses"]/div/div/div/div[2]/p').text
        exec(f"global {_2}\n{_2} = Smalify('{__2}')")
    except:
        pass
    try:
        e=browser.find_element(By.XPATH, '//*[@id="profile-addresses"]/div/div/div/div[3]/span').text
        _3 = names.get(e)
        __3 = browser.find_element(By.XPATH, '//*[@id="profile-addresses"]/div/div/div/div[3]/p').text
        exec(f"global {_3}\n{_3} = Smalify('{__3}')")
    except:
        pass
    try:
        e=browser.find_element(By.XPATH, '//*[@id="profile-addresses"]/div/div/div/div[4]/span').text
        _4 = names.get(e)
        __4 = browser.find_element(By.XPATH, '//*[@id="profile-addresses"]/div/div/div/div[4]/p').text
        exec(f"global {_4}\n{_4} = Smalify('{__4}')")
    except:
        pass
    try:
        e=browser.find_element(By.XPATH, '//*[@id="profile-addresses"]/div/div/div/div[5]/span').text
        _5 = names.get(e)
        __5 = browser.find_element(By.XPATH, '//*[@id="profile-addresses"]/div/div/div/div[5]/p').text
        exec(f"global {_5}\n{_5} = Smalify('{__5}')")

    except:
        pass

The code above will read the span in front of the main element, and than find the matching variable name from the names dictionary and when it did it will se the value of the main element to the variable name using the exec() function.

This code did not work at all for two reasons, A: It always returned Noun even if it could find the elements. B: It took way too long.

I was wondering if there is anyways other than my code to do it efficiently.

CodePudding user response:

You can always try to search by ID, rather than xpath. Since the xpath is variable between the pages, try to find something that is static, such as ID name.

There is some more information about the different ways you can locate specific html elements using selenium at this link. I definitely recommend checking it out.

Here is an example of searching for your elements by their IDs:

browser.find_element(By.ID, "profile-addresses").text

Good luck!

  • Related