I'm trying to scrape all the filenames and their concerning addresses from a static webpage. The script that I've already created can fetch them almost accurately except for keeping a space in a certain position within some addresses. To be clearer, the script besides other results prints the following in the console:
RZ000089 1207, 1211, 1215, 1217, 1219 & 1221Carlisle Avenue
whereas my expected output is (note the space before Carlisle Avenue
):
RZ000089 1207, 1211, 1215, 1217, 1219 & 1221 Carlisle Avenue
Current approach:
import requests
from bs4 import BeautifulSoup
link = 'https://www.esquimalt.ca/business-development/development-tracker/rezoning-applications'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36'
}
with requests.Session() as s:
s.headers.update(headers)
res = s.get(link)
soup = BeautifulSoup(res.text,"lxml")
for item in soup.select("table.table_two_columns > tbody"):
file = item.select_one("tr > td:has(strong:-soup-contains('File:'))").get_text(strip=True).replace("File:","").replace(" "," ").strip()
addr_list = [i.text for i in item.select("tr:nth-of-type(1) > td:nth-of-type(1) > p")]
for addr in addr_list:
print(file,addr)
Output I'm getting (truncated):
RZ000095 A-904 Admirals Road
RZ000089 1207, 1211, 1215, 1217, 1219 & 1221Carlisle Avenue
RZ000089 512 & 522 Fraser Street
RZ000089 1212, 1216, 1220, 1222, 1224 & 1226Lyall Street
RZ000055 1072 Colville Road
RZ000056 1076 Colville Road
Output I wish to get like (note the space before Carlisle Avenue
and Lyall Street
):
RZ000095 A-904 Admirals Road
RZ000089 1207, 1211, 1215, 1217, 1219 & 1221 Carlisle Avenue
RZ000089 512 & 522 Fraser Street
RZ000089 1212, 1216, 1220, 1222, 1224 & 1226 Lyall Street
RZ000055 1072 Colville Road
RZ000056 1076 Colville Road
CodePudding user response:
Instead of i.text
use i.get_text()
with separator=
parameter:
import requests
from bs4 import BeautifulSoup
link = "https://www.esquimalt.ca/business-development/development-tracker/rezoning-applications"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36"
}
with requests.Session() as s:
s.headers.update(headers)
res = s.get(link)
soup = BeautifulSoup(res.text, "lxml")
for item in soup.select("table.table_two_columns > tbody"):
file = (
item.select_one("tr > td:has(strong:-soup-contains('File:'))")
.get_text(strip=True)
.replace("File:", "")
.replace(" ", " ")
.strip()
)
addr_list = [
i.get_text(strip=True, separator=" ")
for i in item.select("tr:nth-of-type(1) > td:nth-of-type(1) > p")
]
for addr in addr_list:
print(file, addr)
Prints:
RZ000095 A-904 Admirals Road
RZ000089 1207, 1211, 1215, 1217, 1219 & 1221 Carlisle Avenue
RZ000089 512 & 522 Fraser Street
RZ000089 1212, 1216, 1220, 1222, 1224 & 1226 Lyall Street
RZ000055 1072 Colville Road
RZ000056 1076 Colville Road
RZ000098 812 Craigflower Road
RZ000083 881 Craigflower Road
RZ000071 820 Dunsmuir Road
...