Home > Software engineering >  Getting the href of the a tag
Getting the href of the a tag

Time:11-17

I want to scrape data from the cinch.co.uk website. I am using Python with BeautifulSoup4 and Request libraries.

For each car ad, I want to get inside each link and then scrape car data. Here is the HTML and CSS of each ad. I can see that when I am not clicking on the h3 tag the text is ... , however, if I click on it is different.

The problem I have is that when I get on the h3 tag level (where the a tag lies), it seems that it cannot see it as after I run ad = car.find('div', {'class': 'jB_k1'}).find('h3') and then I print(ad) I get this. The only reference for the link of the ad is in that a tag so I cannot get the link from other tags. Do I have this problem because the website uses ::before?

This is what I have tried so far:

"""
Method to get the HTML of a page
website - URL of the page

return - HTML of the page

"""
def getData(website):
       response = session.get(website)
       soup = BeautifulSoup(response.text, 'html.parser')
       return soup

"""
Method to get to  the next page
soup - html of a page

return - url of the next page or none if it doesn't exist
"""
def getNextPage(soup):
    pages = soup.find('ul', {'class' :'cf_gY'})
    pages = soup.find_all('li', {'class' : 'cf_kD'})
       
    website = None
    for page in pages:
        if page.find('a', {'aria-label' : 'Next page'}):
            website = 'http://www.cinch.co.uk'   str(page.find('a')['href'])
    
    return website
        
"""
Method to click onto a car ad
car - HTML of the car ad

return - URL of the car ad or none if it doesn't exist
"""
def getIntoPage(car):
    ad = 'https://www.cinch.co.uk'   car.find('a', {'class' : 'jB_dD'})['href']
    return ad

while True:

soup = getData(website)
website = getNextPage(soup)
nr =1

#finds all the cars
cars = soup.find('ol', {'class': 'fJ_gY'})
cars = soup.find_all('article', {'class': 'lC_gQ lC_RB'})

for car in cars:
    
    ad = car.find('div', {'class': 'jB_k1'}).find('h3')
    getIntoPage(ad)
    break
break

The break statements I have are for testing just one ad as there are loads of them on the website.

CodePudding user response:

You get this problem because the website uses javascript which the request module can not render. The only solution I've found so far is to use selenium with webdriver and render the page with javascript. Unfortunately, the request module can not handle dynamic content as far as I know.

  • Related