Home > Mobile >  Can't let a script keep a space in a certain position within some addresses
Can't let a script keep a space in a certain position within some addresses

Time:04-29

I'm trying to scrape all the filenames and their concerning addresses from a static webpage. The script that I've already created can fetch them almost accurately except for keeping a space in a certain position within some addresses. To be clearer, the script besides other results prints the following in the console:

RZ000089 1207, 1211, 1215, 1217, 1219 & 1221Carlisle Avenue

whereas my expected output is (note the space before Carlisle Avenue):

RZ000089 1207, 1211, 1215, 1217, 1219 & 1221 Carlisle Avenue

Current approach:

import requests
from bs4 import BeautifulSoup

link = 'https://www.esquimalt.ca/business-development/development-tracker/rezoning-applications'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36'
}
with requests.Session() as s:
    s.headers.update(headers)
    res = s.get(link)
    soup = BeautifulSoup(res.text,"lxml")
    for item in soup.select("table.table_two_columns > tbody"):
        file = item.select_one("tr > td:has(strong:-soup-contains('File:'))").get_text(strip=True).replace("File:","").replace(" "," ").strip()
        addr_list = [i.text for i in item.select("tr:nth-of-type(1) > td:nth-of-type(1) > p")]
        for addr in addr_list:
            print(file,addr)

Output I'm getting (truncated):

RZ000095 A-904 Admirals Road
RZ000089 1207, 1211, 1215, 1217, 1219 & 1221Carlisle Avenue
RZ000089 512 & 522 Fraser Street
RZ000089 1212, 1216, 1220, 1222, 1224 & 1226Lyall Street
RZ000055 1072 Colville Road
RZ000056 1076 Colville Road

Output I wish to get like (note the space before Carlisle Avenue and Lyall Street):

RZ000095 A-904 Admirals Road
RZ000089 1207, 1211, 1215, 1217, 1219 & 1221 Carlisle Avenue
RZ000089 512 & 522 Fraser Street
RZ000089 1212, 1216, 1220, 1222, 1224 & 1226 Lyall Street
RZ000055 1072 Colville Road
RZ000056 1076 Colville Road

CodePudding user response:

Instead of i.text use i.get_text() with separator= parameter:

import requests
from bs4 import BeautifulSoup

link = "https://www.esquimalt.ca/business-development/development-tracker/rezoning-applications"

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36"
}
with requests.Session() as s:
    s.headers.update(headers)
    res = s.get(link)
    soup = BeautifulSoup(res.text, "lxml")
    for item in soup.select("table.table_two_columns > tbody"):
        file = (
            item.select_one("tr > td:has(strong:-soup-contains('File:'))")
            .get_text(strip=True)
            .replace("File:", "")
            .replace(" ", " ")
            .strip()
        )
        addr_list = [
            i.get_text(strip=True, separator=" ")
            for i in item.select("tr:nth-of-type(1) > td:nth-of-type(1) > p")
        ]
        for addr in addr_list:
            print(file, addr)

Prints:

RZ000095 A-904 Admirals Road
RZ000089 1207, 1211, 1215, 1217, 1219 & 1221 Carlisle Avenue
RZ000089 512 & 522 Fraser Street
RZ000089 1212, 1216, 1220, 1222, 1224 & 1226 Lyall Street
RZ000055 1072 Colville Road
RZ000056 1076 Colville Road
RZ000098 812 Craigflower Road
RZ000083 881 Craigflower Road
RZ000071 820 Dunsmuir Road

...
  • Related