Scraping works sometimes-CodePudding

I am trying to scrape https://www.trulia.com/sold/32303_zip/6_srl/, I simply need the number of homes sold which is stated in the ...."sold homes on Trulia" sentence at the top right. The code below sometimes gets the number, other times it gets "Nearby" from the "Nearby Real Estate" from the bottom, which is another h2 element. What is wrong or missing from my code?

url = f"https://www.trulia.com/sold/32303_zip/6_srl/"

html = requests.get(url)

html = html.content

soup = BeautifulSoup(html, 'html.parser')

soup.get_text()

s = soup.div.h2.get_text()

s = s.split()

s = s[0]

s = s.replace(',', '')

CodePudding user response：

I would use selenium. First:

pip install selenium

Then download a webdriver that matches your current chrome browser. After that, add the chromedriver.exe to your working directory and run the following code:

from selenium import webdriver

driver = webdriver.Chrome()
driver.get("https://www.trulia.com/sold/32303_zip/6_srl/")
try:
    homes_sold = (driver.find_element_by_xpath("//*[@id='resultsColumn']/div[1]/div/div[3]/div/h2").text).split(' ')[0]
except:
    homes_sold = (driver.find_element_by_xpath("//*[@id='resultsColumn']/div[1]/div/div[1]/span").text).split(' ')[0]
    homes_sold = homes_sold.replace('(','')
print(homes_sold)

Using requests would work as well. I am just providing an alternate example.

CodePudding user response：

Resolved using wget and parsing the content as a string as per @Tim Roberts suggestion.