Home > Enterprise >  bs4 findAll not collecting all of the data from the other pages on the website
bs4 findAll not collecting all of the data from the other pages on the website

Time:09-18

I'm trying to scrape a real estate website using BeautifulSoup. I'm trying to get a list of rental prices for London. This works but only for the first page on the website. There are over 150 of them so I'm missing out on a lot of data. I would like to be able to collect all the prices from all the pages. Here is the code I'm using:

import requests
from bs4 import BeautifulSoup as soup

url  = 'https://www.zoopla.co.uk/to-rent/property/central-london/?beds_max=5&price_frequency=per_month&q=Central London&results_sort=newest_listings&search_source=home'
response = requests.get(url)
response.status_code

data  = soup(response.content, 'lxml')

prices = []
for line in data.findAll('div', {'class': 'css-1e28vvi-PriceContainer e2uk8e7'}):
    price = str(line).split('>')[2].split(' ')[0].replace('£', '').replace(',','')
    price = int(price)
    prices.append(price)

Any idea as to why I can't collect the prices from all the pages using this script?

Extra question : is there a way to access the price using soup, IE with doing any list/string manipulation? When I call data.find('div', {'class': 'css-1e28vvi-PriceContainer e2uk8e7'}) I get a string of the following form <div class="css-1e28vvi-PriceContainer e2uk8e7" data-testid="listing-price"><p class="css-1o565rw-Text eczcs4p0" size="6">£3,012 pcm</p></div>

Any help would be much appreciated!

CodePudding user response:

You can append &pn=<page number> parameter to the URL to get next pages:

import re
import requests
from bs4 import BeautifulSoup as soup

url = "https://www.zoopla.co.uk/to-rent/property/central-london/?beds_max=5&price_frequency=per_month&q=Central London&results_sort=newest_listings&search_source=home&pn="

prices = []
for page in range(1, 3):  # <-- increase number of pages here
    data = soup(requests.get(url   str(page)).content, "lxml")

    for line in data.findAll(
        "div", {"class": "css-1e28vvi-PriceContainer e2uk8e7"}
    ):
        price = line.get_text(strip=True)
        price = int(re.sub(r"[^\d]", "", price))
        prices.append(price)
        print(price)
    print("-" * 80)

print(len(prices))

Prints:


...

1993
1993
--------------------------------------------------------------------------------
50
  • Related