Home > OS >  How to scrape website-elements that are using same tags and classes?
How to scrape website-elements that are using same tags and classes?

Time:09-23

Scraping website-elements that are using same keys and classes & challenges with jumping in loop using Python's BeautifulSoup

I have been looking everywhere for a solution, talked with a software engineer and my python-professor regarding this, which couldn’t help. This is my first post so please bear with me:

I try to scrape a website using BeautifulSoup to call the elements I want to extract. The main challenge is that the website I try to scrape from, are using the same classes and keys in multiple occations with different elements (In this example: tag = span classes = text-nowrap). This means my output from the functions defining the elements I want are printing the same (first HTML-line, which has a tag=span and class = text-nowrap from above).

In this instance, I thought I might need an additional parameter in the list.find helping it distinguish as some elements consists of special elements like date-format, m², DKK/m² and so forth? Or is there any other ways?

Furthermore, I seem to have a problem regarding jumping when creating the loop. Printing the soup.find_all class-parameter (in this case, “Container mb-5”), I get all the information I want (unfiltered).However, as soon as I try to create a loop defining the functions of the elements I want printed, the loop only loops once (from the top) giving me one example instead of multiple lines of data. Syntax seems correct.

Below is my Python-code:

pip install beautifulsoup4
pip install lxml
pip install requests

from bs4
import BeautifulSoup
from lxml
import etree
import requests

from csv
import writer

url = "https://www.boliga.dk/salg/resultater?searchTab=1&sort=date-d&saleType=1&propertyType=1,3&salesDateMin=2015"
page = requests.get(url)
print(page)

soup = BeautifulSoup(page.content, "html.parser")
lists = soup.find_all("div", class_ = "container mb-5")


for list in lists:
  resitype = list.find('span', class_ = "text").text
  address = list.find('a', class_ = "text-primary font-weight-bolder text-left").text
  price = list.find('span', class_ = "text-nowrap").text.replace('\xa0kr.', "")
  salesdate = list.find('span', class_ = "text-nowrap")
  rooms = list.find('td', class_ = "table-col d-print-table-cell text-center")
  sqm = list.find('span', class_ = "text-nowrap")
  sqmprice = list.find('span', class_ = "text-nowrap mt-1").text
  salestype = list.find('span', class_ = "text-nowrap mt-1").text
  buildingyear = list.find('span __ngcontent-boliga-app-c182', class_ = "table-col d-print-table-cell text-center")
  procent = list.find('a', class_ = "price-reduced ng-star-inserted")

info = [resitype, address, price, salesdate, rooms, sqm, sqmprice, salestype, buildingyear, procent]

print(info)

CodePudding user response:

To get the releveant data into Pandas dataframe you can do:

import requests
import pandas as pd
from bs4 import BeautifulSoup

url = "https://www.boliga.dk/salg/resultater?searchTab=1&sort=date-d&saleType=1&propertyType=1,3&salesDateMin=2015"
soup = BeautifulSoup(requests.get(url).content, "html.parser")


all_data = []
for row in soup.select("tr"):
    tds = row.select("td")

    resitype = tds[0].p.get_text(strip=True)
    address = tds[0].a.get_text(strip=True, separator=",")
    price = tds[1].get_text(strip=True)
    salesdate = tds[2].get_text(strip=True, separator=" ")
    sqm = tds[3].get_text(strip=True, separator=",").split(",")[0]
    sqmprice = tds[3].get_text(strip=True, separator=",").split(",")[-1]
    rooms = tds[4].get_text(strip=True)
    buildingyear = tds[5].get_text(strip=True)
    procent = tds[6].get_text(strip=True)

    all_data.append(
        {
            "resitype": resitype,
            "address": address,
            "price": price,
            "salesdate": salesdate,
            "sqm": sqm,
            "sqmprice": sqmprice,
            "rooms": rooms,
            "buildingyear": buildingyear,
            "procent": procent,
        }
    )

df = pd.DataFrame(all_data)
print(df)

Prints:

         resitype                                           address           price             salesdate     sqm       sqmprice rooms buildingyear procent
0   Ejerlejlighed                      Storegade 31A, st,4780 Stege   2.200.000 kr.  14-09-2022 Alm. Salg   95 m²   23.158 kr/m²     3         1890        
1   Ejerlejlighed                       Storegade 31A, 1,4780 Stege   2.200.000 kr.  14-09-2022 Alm. Salg   99 m²   22.222 kr/m²     3         1890        
2   Ejerlejlighed                          Storegade 31B,4780 Stege   2.200.000 kr.  14-09-2022 Alm. Salg   48 m²   45.833 kr/m²     2         1890        
3           Villa                         Hybenvænget 6,5800 Nyborg   1.795.000 kr.  13-09-2022 Alm. Salg   96 m²   18.698 kr/m²     4         1971        
4           Villa                     Grønnevang 306,3250 Gilleleje   2.595.000 kr.  12-09-2022 Alm. Salg  105 m²   24.714 kr/m²     4         1987        
5           Villa                          Teglvej 10,9800 Hjørring     860.000 kr.  08-09-2022 Alm. Salg  139 m²    6.187 kr/m²     4         1954        
6   Ejerlejlighed            Benløseparken 177, 2. tv,4100 Ringsted   1.500.000 kr.  07-09-2022 Alm. Salg  103 m²   14.563 kr/m²     4         1980     -3%
7   Ejerlejlighed                Lundevej 36, st. 4,4400 Kalundborg   1.450.000 kr.  07-09-2022 Alm. Salg  134 m²   10.821 kr/m²     2         1957     -3%
8   Ejerlejlighed          Lundtoftegade 93, 4. th,2200 København N   3.970.000 kr.  07-09-2022 Alm. Salg   70 m²   56.714 kr/m²     3         1930     -1%
9   Ejerlejlighed           Øresundsvej 112, 3. th,2300 København S   2.650.000 kr.  07-09-2022 Alm. Salg   62 m²   42.742 kr/m²     2         1932     -5%
10  Ejerlejlighed             Åboulevard 60, 1. tv,2200 København N   4.200.000 kr.  07-09-2022 Alm. Salg  104 m²   40.385 kr/m²     3         1903     -7%
11  Ejerlejlighed             Ålandsgade 18, 3. tv,2300 København S   2.025.000 kr.  07-09-2022 Alm. Salg   41 m²   49.390 kr/m²     1         1940     -8%
12          Villa                       Tømmerupvej 209,2791 Dragør   3.445.000 kr.  07-09-2022 Alm. Salg  201 m²   17.139 kr/m²     7         1928        
13          Villa                   Lillebjergvej 35,3390 Hundested   3.795.000 kr.  07-09-2022 Alm. Salg  182 m²   20.852 kr/m²     6         1979        
14          Villa                             Ageren 19,4652 Hårlev   2.760.000 kr.  07-09-2022 Alm. Salg  174 m²   15.862 kr/m²     5         1973     -3%
15          Villa                      Holbækvej 44,4400 Kalundborg   2.095.000 kr.  07-09-2022 Alm. Salg  176 m²   11.903 kr/m²     5         1921        
16          Villa                            Lokesvej 5,4220 Korsør   1.100.000 kr.  07-09-2022 Alm. Salg  222 m²    4.955 kr/m²     8         1971        
17          Villa                         Britaniavej 3,8500 Grenaa     486.500 kr.  07-09-2022 Alm. Salg  134 m²    3.631 kr/m²     4         1978        
18          Villa                           Assensvej 15,5853 Ørbæk     750.000 kr.  07-09-2022 Alm. Salg  159 m²    4.717 kr/m²     8         1930        
19          Villa             Ålsgårde Stationsvej 19,3140 Ålsgårde   1.700.000 kr.  07-09-2022 Alm. Salg  114 m²   14.912 kr/m²     5         1965        
20          Villa                        Platanvej 15,4000 Roskilde   6.300.000 kr.  07-09-2022 Alm. Salg  146 m²   43.151 kr/m²     7         1947        
21          Villa                       Saugstedvang 5,5600 Faaborg     900.000 kr.  07-09-2022 Alm. Salg  157 m²    5.732 kr/m²     3         1976        
22          Villa                      Thorsvænget 1,3000 Helsingør   3.850.000 kr.  07-09-2022 Alm. Salg  122 m²   31.557 kr/m²     5         1905        
23  Ejerlejlighed        Frederikssundsvej 408, 2. tv,2700 Brønshøj   2.045.000 kr.  06-09-2022 Alm. Salg   63 m²   32.460 kr/m²     3         1954     -2%
24  Ejerlejlighed                      Ørebakken 22B,3000 Helsingør   1.245.000 kr.  06-09-2022 Alm. Salg  217 m²    5.737 kr/m²     6         1897        
25  Ejerlejlighed  Henrik Ibsens Vej 10, 4. th,1813 Frederiksberg C   6.220.000 kr.  06-09-2022 Alm. Salg   84 m²   74.048 kr/m²     3         1899        
26  Ejerlejlighed              Messinavej 9, 1. tv,2300 København S   2.445.000 kr.  06-09-2022 Alm. Salg   55 m²   44.455 kr/m²     2         1937     -2%
27  Ejerlejlighed               Blåbærhaven 12, 1. mf,2980 Kokkedal     600.000 kr.  06-09-2022 Alm. Salg   48 m²   12.500 kr/m²     2         1973        
28  Ejerlejlighed                  Folehaven 114, st. th,2500 Valby   1.850.000 kr.  06-09-2022 Alm. Salg   58 m²   31.897 kr/m²     2         1937     -7%
29  Ejerlejlighed                     Bøgelundsvej 67B,6920 Videbæk     620.000 kr.  06-09-2022 Alm. Salg   77 m²    8.052 kr/m²     2         1978     -5%
30          Villa                        Skolestien 4,3150 Hellebæk   7.500.000 kr.  06-09-2022 Alm. Salg  218 m²   34.404 kr/m²     5         1995     -5%
31          Villa                         Møllevangen 9,8450 Hammel   2.550.000 kr.  06-09-2022 Alm. Salg  179 m²   14.246 kr/m²     5         1973        
32          Villa                     Ålholmparken 95,3400 Hillerød   2.864.500 kr.  06-09-2022 Alm. Salg  184 m²   15.568 kr/m²     7         1970        
33          Villa                         Engmarkvej 11,7620 Lemvig     250.000 kr.  06-09-2022 Alm. Salg  151 m²    1.656 kr/m²     3         1905        
34          Villa                  Dronningensgade 19,4100 Ringsted   2.750.000 kr.  06-09-2022 Alm. Salg   95 m²   28.947 kr/m²     2         1938        
35          Villa                     Ndr Dragørvej 173,2791 Dragør   2.875.000 kr.  06-09-2022 Alm. Salg   76 m²   37.829 kr/m²     4         1915        
36          Villa                               Sdr Alle 9,9760 Vrå     430.000 kr.  06-09-2022 Alm. Salg  122 m²    3.525 kr/m²     3         1927    -13%
37  Ejerlejlighed         Borups Allé 235B, 2. th,2400 København NV   2.450.000 kr.  05-09-2022 Alm. Salg   66 m²   37.121 kr/m²     2         1921     -2%
38  Ejerlejlighed     Johan Kellers Vej 49, 1. th,2450 København SV   1.285.000 kr.  05-09-2022 Alm. Salg   59 m²   21.780 kr/m²     2         1936        
39  Ejerlejlighed             Middelfartvej 54, 2. th,5200 Odense V   1.635.000 kr.  05-09-2022 Alm. Salg   94 m²   17.394 kr/m²     4         1956        
40  Ejerlejlighed                 Bagerstræde 9, 3,1617 København V   9.000.000 kr.  05-09-2022 Alm. Salg  154 m²   58.442 kr/m²     5         1908     -5%
41  Ejerlejlighed              Willemoesgade 45, 4,2100 København Ø  14.500.000 kr.  05-09-2022 Alm. Salg  188 m²   77.128 kr/m²     5         1889        
42  Ejerlejlighed          Brandholms Alle 28B, st. th,2610 Rødovre     820.000 kr.  05-09-2022 Alm. Salg   66 m²   12.424 kr/m²     3         1961        
43  Ejerlejlighed                        Roret 119,3070 Snekkersten   3.695.000 kr.  05-09-2022 Alm. Salg  112 m²   32.991 kr/m²     3         2002        
44  Ejerlejlighed               Bjergbygade 6, st. tv,4200 Slagelse  13.150.000 kr.  05-09-2022 Alm. Salg  113 m²  116.372 kr/m²     4         1960        
45  Ejerlejlighed                Bjergbygade 6, 1. th,4200 Slagelse  13.150.000 kr.  05-09-2022 Alm. Salg  113 m²  116.372 kr/m²     4         1960        
46  Ejerlejlighed                Bjergbygade 6, 3. tv,4200 Slagelse  13.150.000 kr.  05-09-2022 Alm. Salg  102 m²  128.922 kr/m²     4         1960        
47  Ejerlejlighed                Bjergbygade 6, 3. th,4200 Slagelse  13.150.000 kr.  05-09-2022 Alm. Salg   98 m²  134.184 kr/m²     4         1960        
48  Ejerlejlighed                Bjergbygade 6, 2. tv,4200 Slagelse  13.150.000 kr.  05-09-2022 Alm. Salg  125 m²  105.200 kr/m²     5         1960        
49  Ejerlejlighed                Bjergbygade 6, 1. tv,4200 Slagelse  13.150.000 kr.  05-09-2022 Alm. Salg  116 m²  113.362 kr/m²     4         1960        

CodePudding user response:

Note: Avoid using python reserved terms (keywords), this could have unwanted effects on the results of your code.

  1. Select your elements more specific - I would recommend to avoid unspecific classes and like to go with css selectors, more stetic identifiers and HTLM structure.

  2. You could also assign values to multiple variables in one go.

  3. Put info into your for-loop to call it immediately in each iteration.

Example

import requests
from bs4 import BeautifulSoup
import csv

url = "https://www.boliga.dk/salg/resultater?searchTab=1&sort=date-d&saleType=1&propertyType=1,3&salesDateMin=2015"
soup = BeautifulSoup(requests.get(url).content)

with open('mycsvfile.csv', 'w', encoding='utf-8', newline="") as f:
    w = csv.writer(f)
    w.writerow(['resitype', 'address', 'price', 'salesdate', 'rooms', 'sqm', 'sqmprice', 'salestype', 'buildingyear', 'procent'])
    for row in soup.select('table tr'):
        resitype = row.select_one('app-tooltip span').text
        address = row.a.text
        price = row.select_one('td:nth-of-type(2)').text.split()[0]
        salesdate,salestype = row.select_one('td:nth-of-type(3)').stripped_strings
        rooms = row.select_one('td:nth-of-type(5)').text
        sqm,sqmprice = row.select_one('td:nth-of-type(4)').stripped_strings
        buildingyear = row.select_one('td:nth-of-type(6)').text
        procent = row.select_one('td:nth-of-type(7)').text
        info = [resitype, address, price, salesdate, rooms, sqm, sqmprice, salestype, buildingyear, procent]
        print(info)
        #write your result to csv
        w.writerow(info)

Output

resitype,address,price,salesdate,rooms,sqm,sqmprice,salestype,buildingyear,procent
Ejerlejlighed," Storegade 31A, st 4780 Stege ",2.200.000,14-09-2022, 3 ,95 m²,23.158 kr/m²,Alm. Salg, 1890 ,
Ejerlejlighed," Storegade 31A, 1 4780 Stege ",2.200.000,14-09-2022, 3 ,99 m²,22.222 kr/m²,Alm. Salg, 1890 ,
Ejerlejlighed, Storegade 31B 4780 Stege ,2.200.000,14-09-2022, 2 ,48 m²,45.833 kr/m²,Alm. Salg, 1890 ,
Villa, Hybenvænget 6 5800 Nyborg ,1.795.000,13-09-2022, 4 ,96 m²,18.698 kr/m²,Alm. Salg, 1971 ,
Villa, Grønnevang 306 3250 Gilleleje ,2.595.000,12-09-2022, 4 ,105 m²,24.714 kr/m²,Alm. Salg, 1987 ,
Villa, Teglvej 10 9800 Hjørring ,860.000,08-09-2022, 4 ,139 m²,6.187 kr/m²,Alm. Salg, 1954 ,
Ejerlejlighed," Benløseparken 177, 2. tv 4100 Ringsted ",1.500.000,07-09-2022, 4 ,103 m²,14.563 kr/m²,Alm. Salg, 1980 , -3% 
Ejerlejlighed," Lundevej 36, st. 4 4400 Kalundborg ",1.450.000,07-09-2022, 2 ,134 m²,10.821 kr/m²,Alm. Salg, 1957 , -3% 
...
  • Related