Home > OS >  How to get all page results - Web Scraping - Pagination
How to get all page results - Web Scraping - Pagination

Time:06-10

I am a beginner in regards to coding. Right now I am trying to get a grip on simple web scrapers using python.

I want to scrape a real estate website and get the Title, price, sqm, and what not into a CSV file.

My questions:

  1. It seems to work for the first page of results but then it repeats and it does not run through the 40 pages. It rather fills the file with the same results.

  2. The listings have info about "square meter" and the "number of rooms". When I inspect the page it seems though that it uses the same class for both elements. How would I extract the room numbers for example?

Here is the code that I have gathered so far:

import requests
from bs4 import BeautifulSoup
import pandas as pd

def extract(page):
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.63 Safari/537.36'}
    url = f'https://www.immonet.de/immobiliensuche/sel.do?suchart=2&city=109447&marketingtype=1&pageoffset=1&radius=0&parentcat=2&sortby=0&listsize=26&objecttype=1&page={1}'
    r = requests.get(url, headers)
    soup = BeautifulSoup(r.content, 'html.parser')
    return soup

def transform(soup):
    divs = soup.find_all('div', class_ = 'col-xs-12 place-over-understitial sel-bg-gray-lighter')
    for item in divs:
        title = item.find('div', {'class': 'text-225'}).text.strip().replace('\n', '')
        title2 = title.replace('\t', '')
        hausart = item.find('span', class_ = 'text-100').text.strip().replace('\n', '')
        hausart2 = hausart.replace('\t', '')
        try:
            price = item.find('span', class_ = 'text-250 text-strong text-nowrap').text.strip()
        except:
            price = 'Auf Anfrage'
        wohnflaeche = item.find('p', class_ = 'text-250 text-strong text-nowrap').text.strip().replace('m²', '')

        angebot = {
            'title': title2,
            'hausart': hausart2,
            'price': price
        } 
        hauslist.append(angebot)
    return

hauslist=[]

for i in range(0, 40):
    print(f'Getting page {i}...')
    c = extract(i)
    transform(c)

df = pd.DataFrame(hauslist)
print(df.head())
df.to_csv('immonetHamburg.csv')

This is my first post on stackoverflow so please be kind if I should have posted my problem differently.

Thanks Pat

CodePudding user response:

You have stupid mistake.

In url you have to use {page} instead of {1}. That's all.

url = f'https://www.immonet.de/immobiliensuche/sel.do?suchart=2&city=109447&marketingtype=1&pageoffset=1&radius=0&parentcat=2&sortby=0&listsize=26&objecttype=1&page={page}'

I see other problem:

You start scraping at page 0 but servers often give the same result for page 0 and 1.
You should use range(1, ...) instead of range(0, ...)


As for searching elements.

Beautifulsoup may search not only classes but also id and any other value in tag - ie. name, style, data, etc. It can also search by text "number of rooms". It can also use regex for this. You can also assign own function which will check element and return True/False to decide if it has to keep it in results.

You can also combine .find() with another .find() or .find_all().

price = item.find('div', {"id": lambda value:value and value.startswith('selPrice')}).find('span')
if price: 
    print("price:", price.text)

And if you know that "square meter" is before "number of rooms" then you could use find_all() to get both of them and later use [0] to get first of them and [1] to get second of them.

You should read all documentation beacause it can be very useful.

CodePudding user response:

I advice you use Selenium instead, because you can physically click the 'next-page' button until you cover all pages and the whole code will only take a few lines.

CodePudding user response:

As @furas mentioned you have a mistake with the page.
To get all rooms you need to find_all and get the last index with -1. Because sometimes there are 3 items or 2.

#to remote all \n and \r
translator = str.maketrans({chr(10): '', chr(9): ''})
rooms = item.find_all('p', {'class': 'text-250'})
if rooms:
 rooms = rooms[-1].text.translate(translator).strip()
  • Related