Home > Mobile >  How to stop trailing zeros splitting across cell with Beautiful Soup?
How to stop trailing zeros splitting across cell with Beautiful Soup?

Time:04-12

I'm building a web scraper. The top line on this data scrape splits the title because there the number "1,000" at the end. How do I stop this from happening?

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

my_url = 'https://www.topcashback.co.uk/easyjet-holidays/'
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")

containers = page_soup.findAll("tr")[1:]
filename = "topcashbackEasyJetholidays.csv"
f = open(filename,"w")
headers = "title, rate \n"
f.write(headers)

for container in containers:
    title = container.td.div.span.text
    rate = container.find("span",{"class":"cashback-desc"}).text
    print("title: "   title)
    print("rate: "   rate)
    f.write(title   ","   rate   "," "\n") 

f.close()

CodePudding user response:

The easy and ugly way - cover title with quotes so the comma in 1,000 won't be treat as separator in csv.

f.write('"'   title   '",'   rate   "," "\n") # btw. why the last comma?
# or with f-string
f.write(f'"{title}",{rate}\n")

The more fancy way - use csvwriter

CodePudding user response:

I would check out this before trying to reinvent the wheel:

import pandas as pd

my_url = 'https://www.topcashback.co.uk/easyjet-holidays/'
tables = pd.read_html(my_url, encoding='utf-8')
df = tables[0]
df.columns = ['title', 'n/a', 'rate']
df = df[['title', 'rate']]
df.to_csv("topcashbackEasyJetholidays.csv", index=False)

print(df)

Output:

                                   title    rate
0  London Gatwick Departures over £1,000  £50.00
1        Holiday Bookings £1000 and Over  £40.00
2        Holiday Bookings £999 and Under  £25.00

CSV:

title,rate
"London Gatwick Departures over £1,000",£50.00
Holiday Bookings £1000 and Over,£40.00
Holiday Bookings £999 and Under,£25.00

You'll also need to have lxml installed, aka pip install lxml

CodePudding user response:

Here's the "fancy way", which I think is clearly the better way to go. I find it to actually be an easier and simpler way to code up the problem:

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
import csv

my_url = 'https://www.topcashback.co.uk/easyjet-holidays/'
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")

containers = page_soup.findAll("tr")[1:]
filename = "topcashbackEasyJetholidays.csv"
with open(filename,"w") as f:
    writer = csv.writer(f)
    writer.writerow(["title", "rate"])
    for container in containers:
        title = container.td.div.span.text
        rate = container.find("span",{"class":"cashback-desc"}).text
        print("title: "   title)
        print("rate: "   rate)
        writer.writerow([title, rate])

There are other advantages to using a CSV writer. The code is more readable and the details of the CSV file format are hidden. There are other characters that could cause you problems and the CSV writer will transparently deal with all of them. The CSV writer will only use quotes when it has to, making your CSV file smaller. If you support multiple output formats, the same code can be used to write all of them by just creating different kinds of writers at the start of the writing code.

  • Related