Scraping website-elements that are using same keys and classes & challenges with jumping in loop using Python's BeautifulSoup
I have been looking everywhere for a solution, talked with a software engineer and my python-professor regarding this, which couldn’t help. This is my first post so please bear with me:
I try to scrape a website using BeautifulSoup to call the elements I want to extract. The main challenge is that the website I try to scrape from, are using the same classes and keys in multiple occations with different elements (In this example: tag = span
classes = text-nowrap
).
This means my output from the functions defining the elements I want are printing the same (first HTML-line, which has a tag=span
and class = text-nowrap
from above).
In this instance, I thought I might need an additional parameter in the list.find helping it distinguish as some elements consists of special elements like date-format, m², DKK/m² and so forth? Or is there any other ways?
Furthermore, I seem to have a problem regarding jumping when creating the loop. Printing the soup.find_all class-parameter (in this case, “Container mb-5”), I get all the information I want (unfiltered).However, as soon as I try to create a loop defining the functions of the elements I want printed, the loop only loops once (from the top) giving me one example instead of multiple lines of data. Syntax seems correct.
Below is my Python-code:
pip install beautifulsoup4
pip install lxml
pip install requests
from bs4
import BeautifulSoup
from lxml
import etree
import requests
from csv
import writer
url = "https://www.boliga.dk/salg/resultater?searchTab=1&sort=date-d&saleType=1&propertyType=1,3&salesDateMin=2015"
page = requests.get(url)
print(page)
soup = BeautifulSoup(page.content, "html.parser")
lists = soup.find_all("div", class_ = "container mb-5")
for list in lists:
resitype = list.find('span', class_ = "text").text
address = list.find('a', class_ = "text-primary font-weight-bolder text-left").text
price = list.find('span', class_ = "text-nowrap").text.replace('\xa0kr.', "")
salesdate = list.find('span', class_ = "text-nowrap")
rooms = list.find('td', class_ = "table-col d-print-table-cell text-center")
sqm = list.find('span', class_ = "text-nowrap")
sqmprice = list.find('span', class_ = "text-nowrap mt-1").text
salestype = list.find('span', class_ = "text-nowrap mt-1").text
buildingyear = list.find('span __ngcontent-boliga-app-c182', class_ = "table-col d-print-table-cell text-center")
procent = list.find('a', class_ = "price-reduced ng-star-inserted")
info = [resitype, address, price, salesdate, rooms, sqm, sqmprice, salestype, buildingyear, procent]
print(info)
CodePudding user response:
To get the releveant data into Pandas dataframe you can do:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = "https://www.boliga.dk/salg/resultater?searchTab=1&sort=date-d&saleType=1&propertyType=1,3&salesDateMin=2015"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
all_data = []
for row in soup.select("tr"):
tds = row.select("td")
resitype = tds[0].p.get_text(strip=True)
address = tds[0].a.get_text(strip=True, separator=",")
price = tds[1].get_text(strip=True)
salesdate = tds[2].get_text(strip=True, separator=" ")
sqm = tds[3].get_text(strip=True, separator=",").split(",")[0]
sqmprice = tds[3].get_text(strip=True, separator=",").split(",")[-1]
rooms = tds[4].get_text(strip=True)
buildingyear = tds[5].get_text(strip=True)
procent = tds[6].get_text(strip=True)
all_data.append(
{
"resitype": resitype,
"address": address,
"price": price,
"salesdate": salesdate,
"sqm": sqm,
"sqmprice": sqmprice,
"rooms": rooms,
"buildingyear": buildingyear,
"procent": procent,
}
)
df = pd.DataFrame(all_data)
print(df)
Prints:
resitype address price salesdate sqm sqmprice rooms buildingyear procent
0 Ejerlejlighed Storegade 31A, st,4780 Stege 2.200.000 kr. 14-09-2022 Alm. Salg 95 m² 23.158 kr/m² 3 1890
1 Ejerlejlighed Storegade 31A, 1,4780 Stege 2.200.000 kr. 14-09-2022 Alm. Salg 99 m² 22.222 kr/m² 3 1890
2 Ejerlejlighed Storegade 31B,4780 Stege 2.200.000 kr. 14-09-2022 Alm. Salg 48 m² 45.833 kr/m² 2 1890
3 Villa Hybenvænget 6,5800 Nyborg 1.795.000 kr. 13-09-2022 Alm. Salg 96 m² 18.698 kr/m² 4 1971
4 Villa Grønnevang 306,3250 Gilleleje 2.595.000 kr. 12-09-2022 Alm. Salg 105 m² 24.714 kr/m² 4 1987
5 Villa Teglvej 10,9800 Hjørring 860.000 kr. 08-09-2022 Alm. Salg 139 m² 6.187 kr/m² 4 1954
6 Ejerlejlighed Benløseparken 177, 2. tv,4100 Ringsted 1.500.000 kr. 07-09-2022 Alm. Salg 103 m² 14.563 kr/m² 4 1980 -3%
7 Ejerlejlighed Lundevej 36, st. 4,4400 Kalundborg 1.450.000 kr. 07-09-2022 Alm. Salg 134 m² 10.821 kr/m² 2 1957 -3%
8 Ejerlejlighed Lundtoftegade 93, 4. th,2200 København N 3.970.000 kr. 07-09-2022 Alm. Salg 70 m² 56.714 kr/m² 3 1930 -1%
9 Ejerlejlighed Øresundsvej 112, 3. th,2300 København S 2.650.000 kr. 07-09-2022 Alm. Salg 62 m² 42.742 kr/m² 2 1932 -5%
10 Ejerlejlighed Åboulevard 60, 1. tv,2200 København N 4.200.000 kr. 07-09-2022 Alm. Salg 104 m² 40.385 kr/m² 3 1903 -7%
11 Ejerlejlighed Ålandsgade 18, 3. tv,2300 København S 2.025.000 kr. 07-09-2022 Alm. Salg 41 m² 49.390 kr/m² 1 1940 -8%
12 Villa Tømmerupvej 209,2791 Dragør 3.445.000 kr. 07-09-2022 Alm. Salg 201 m² 17.139 kr/m² 7 1928
13 Villa Lillebjergvej 35,3390 Hundested 3.795.000 kr. 07-09-2022 Alm. Salg 182 m² 20.852 kr/m² 6 1979
14 Villa Ageren 19,4652 Hårlev 2.760.000 kr. 07-09-2022 Alm. Salg 174 m² 15.862 kr/m² 5 1973 -3%
15 Villa Holbækvej 44,4400 Kalundborg 2.095.000 kr. 07-09-2022 Alm. Salg 176 m² 11.903 kr/m² 5 1921
16 Villa Lokesvej 5,4220 Korsør 1.100.000 kr. 07-09-2022 Alm. Salg 222 m² 4.955 kr/m² 8 1971
17 Villa Britaniavej 3,8500 Grenaa 486.500 kr. 07-09-2022 Alm. Salg 134 m² 3.631 kr/m² 4 1978
18 Villa Assensvej 15,5853 Ørbæk 750.000 kr. 07-09-2022 Alm. Salg 159 m² 4.717 kr/m² 8 1930
19 Villa Ålsgårde Stationsvej 19,3140 Ålsgårde 1.700.000 kr. 07-09-2022 Alm. Salg 114 m² 14.912 kr/m² 5 1965
20 Villa Platanvej 15,4000 Roskilde 6.300.000 kr. 07-09-2022 Alm. Salg 146 m² 43.151 kr/m² 7 1947
21 Villa Saugstedvang 5,5600 Faaborg 900.000 kr. 07-09-2022 Alm. Salg 157 m² 5.732 kr/m² 3 1976
22 Villa Thorsvænget 1,3000 Helsingør 3.850.000 kr. 07-09-2022 Alm. Salg 122 m² 31.557 kr/m² 5 1905
23 Ejerlejlighed Frederikssundsvej 408, 2. tv,2700 Brønshøj 2.045.000 kr. 06-09-2022 Alm. Salg 63 m² 32.460 kr/m² 3 1954 -2%
24 Ejerlejlighed Ørebakken 22B,3000 Helsingør 1.245.000 kr. 06-09-2022 Alm. Salg 217 m² 5.737 kr/m² 6 1897
25 Ejerlejlighed Henrik Ibsens Vej 10, 4. th,1813 Frederiksberg C 6.220.000 kr. 06-09-2022 Alm. Salg 84 m² 74.048 kr/m² 3 1899
26 Ejerlejlighed Messinavej 9, 1. tv,2300 København S 2.445.000 kr. 06-09-2022 Alm. Salg 55 m² 44.455 kr/m² 2 1937 -2%
27 Ejerlejlighed Blåbærhaven 12, 1. mf,2980 Kokkedal 600.000 kr. 06-09-2022 Alm. Salg 48 m² 12.500 kr/m² 2 1973
28 Ejerlejlighed Folehaven 114, st. th,2500 Valby 1.850.000 kr. 06-09-2022 Alm. Salg 58 m² 31.897 kr/m² 2 1937 -7%
29 Ejerlejlighed Bøgelundsvej 67B,6920 Videbæk 620.000 kr. 06-09-2022 Alm. Salg 77 m² 8.052 kr/m² 2 1978 -5%
30 Villa Skolestien 4,3150 Hellebæk 7.500.000 kr. 06-09-2022 Alm. Salg 218 m² 34.404 kr/m² 5 1995 -5%
31 Villa Møllevangen 9,8450 Hammel 2.550.000 kr. 06-09-2022 Alm. Salg 179 m² 14.246 kr/m² 5 1973
32 Villa Ålholmparken 95,3400 Hillerød 2.864.500 kr. 06-09-2022 Alm. Salg 184 m² 15.568 kr/m² 7 1970
33 Villa Engmarkvej 11,7620 Lemvig 250.000 kr. 06-09-2022 Alm. Salg 151 m² 1.656 kr/m² 3 1905
34 Villa Dronningensgade 19,4100 Ringsted 2.750.000 kr. 06-09-2022 Alm. Salg 95 m² 28.947 kr/m² 2 1938
35 Villa Ndr Dragørvej 173,2791 Dragør 2.875.000 kr. 06-09-2022 Alm. Salg 76 m² 37.829 kr/m² 4 1915
36 Villa Sdr Alle 9,9760 Vrå 430.000 kr. 06-09-2022 Alm. Salg 122 m² 3.525 kr/m² 3 1927 -13%
37 Ejerlejlighed Borups Allé 235B, 2. th,2400 København NV 2.450.000 kr. 05-09-2022 Alm. Salg 66 m² 37.121 kr/m² 2 1921 -2%
38 Ejerlejlighed Johan Kellers Vej 49, 1. th,2450 København SV 1.285.000 kr. 05-09-2022 Alm. Salg 59 m² 21.780 kr/m² 2 1936
39 Ejerlejlighed Middelfartvej 54, 2. th,5200 Odense V 1.635.000 kr. 05-09-2022 Alm. Salg 94 m² 17.394 kr/m² 4 1956
40 Ejerlejlighed Bagerstræde 9, 3,1617 København V 9.000.000 kr. 05-09-2022 Alm. Salg 154 m² 58.442 kr/m² 5 1908 -5%
41 Ejerlejlighed Willemoesgade 45, 4,2100 København Ø 14.500.000 kr. 05-09-2022 Alm. Salg 188 m² 77.128 kr/m² 5 1889
42 Ejerlejlighed Brandholms Alle 28B, st. th,2610 Rødovre 820.000 kr. 05-09-2022 Alm. Salg 66 m² 12.424 kr/m² 3 1961
43 Ejerlejlighed Roret 119,3070 Snekkersten 3.695.000 kr. 05-09-2022 Alm. Salg 112 m² 32.991 kr/m² 3 2002
44 Ejerlejlighed Bjergbygade 6, st. tv,4200 Slagelse 13.150.000 kr. 05-09-2022 Alm. Salg 113 m² 116.372 kr/m² 4 1960
45 Ejerlejlighed Bjergbygade 6, 1. th,4200 Slagelse 13.150.000 kr. 05-09-2022 Alm. Salg 113 m² 116.372 kr/m² 4 1960
46 Ejerlejlighed Bjergbygade 6, 3. tv,4200 Slagelse 13.150.000 kr. 05-09-2022 Alm. Salg 102 m² 128.922 kr/m² 4 1960
47 Ejerlejlighed Bjergbygade 6, 3. th,4200 Slagelse 13.150.000 kr. 05-09-2022 Alm. Salg 98 m² 134.184 kr/m² 4 1960
48 Ejerlejlighed Bjergbygade 6, 2. tv,4200 Slagelse 13.150.000 kr. 05-09-2022 Alm. Salg 125 m² 105.200 kr/m² 5 1960
49 Ejerlejlighed Bjergbygade 6, 1. tv,4200 Slagelse 13.150.000 kr. 05-09-2022 Alm. Salg 116 m² 113.362 kr/m² 4 1960
CodePudding user response:
Note: Avoid using python
reserved terms (keywords
), this could have unwanted effects on the results of your code.
Select your elements more specific - I would recommend to avoid unspecific
classes
and like to go withcss selectors
, more stetic identifiers and HTLM structure.You could also assign values to multiple variables in one go.
Put
info
into yourfor-loop
to call it immediately in each iteration.
Example
import requests
from bs4 import BeautifulSoup
import csv
url = "https://www.boliga.dk/salg/resultater?searchTab=1&sort=date-d&saleType=1&propertyType=1,3&salesDateMin=2015"
soup = BeautifulSoup(requests.get(url).content)
with open('mycsvfile.csv', 'w', encoding='utf-8', newline="") as f:
w = csv.writer(f)
w.writerow(['resitype', 'address', 'price', 'salesdate', 'rooms', 'sqm', 'sqmprice', 'salestype', 'buildingyear', 'procent'])
for row in soup.select('table tr'):
resitype = row.select_one('app-tooltip span').text
address = row.a.text
price = row.select_one('td:nth-of-type(2)').text.split()[0]
salesdate,salestype = row.select_one('td:nth-of-type(3)').stripped_strings
rooms = row.select_one('td:nth-of-type(5)').text
sqm,sqmprice = row.select_one('td:nth-of-type(4)').stripped_strings
buildingyear = row.select_one('td:nth-of-type(6)').text
procent = row.select_one('td:nth-of-type(7)').text
info = [resitype, address, price, salesdate, rooms, sqm, sqmprice, salestype, buildingyear, procent]
print(info)
#write your result to csv
w.writerow(info)
Output
resitype,address,price,salesdate,rooms,sqm,sqmprice,salestype,buildingyear,procent
Ejerlejlighed," Storegade 31A, st 4780 Stege ",2.200.000,14-09-2022, 3 ,95 m²,23.158 kr/m²,Alm. Salg, 1890 ,
Ejerlejlighed," Storegade 31A, 1 4780 Stege ",2.200.000,14-09-2022, 3 ,99 m²,22.222 kr/m²,Alm. Salg, 1890 ,
Ejerlejlighed, Storegade 31B 4780 Stege ,2.200.000,14-09-2022, 2 ,48 m²,45.833 kr/m²,Alm. Salg, 1890 ,
Villa, Hybenvænget 6 5800 Nyborg ,1.795.000,13-09-2022, 4 ,96 m²,18.698 kr/m²,Alm. Salg, 1971 ,
Villa, Grønnevang 306 3250 Gilleleje ,2.595.000,12-09-2022, 4 ,105 m²,24.714 kr/m²,Alm. Salg, 1987 ,
Villa, Teglvej 10 9800 Hjørring ,860.000,08-09-2022, 4 ,139 m²,6.187 kr/m²,Alm. Salg, 1954 ,
Ejerlejlighed," Benløseparken 177, 2. tv 4100 Ringsted ",1.500.000,07-09-2022, 4 ,103 m²,14.563 kr/m²,Alm. Salg, 1980 , -3%
Ejerlejlighed," Lundevej 36, st. 4 4400 Kalundborg ",1.450.000,07-09-2022, 2 ,134 m²,10.821 kr/m²,Alm. Salg, 1957 , -3%
...