Home > Enterprise >  Trying to web scrape text from a table on a website
Trying to web scrape text from a table on a website

Time:10-26

I am a novice at this, but I've been trying to scrape data on a website (https://awards.decanter.com/DWWA/2022/search/wines?competitionType=DWWA) but I keep coming up empty. I've tried BeautifulSoup and Scrapy but I can't get the text out.

Eventually I want to get the row of each individual wine in the table into a dataframe/csv (from all pages) but currently I can't even get the first wine producer name.

If you inspect the webpage all the details are in tags with no id or class.

My BeautifulSoup attempt

URL = 'https://awards.decanter.com/DWWA/2022/search/wines?competitionType=DWWA'
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) \
Chrome/106.0.0.0 Safari/537.36 Edg/106.0.1370.52"}

page = requests.get(URL, headers=headers)

soup = BeautifulSoup(page.content, "html.parser")
soup2 = soup.prettify()

producer = soup2.find_all('td').get_text()

print(producer)

Which is throwing the error:

producer = soup2.find_all('td').get_text()
AttributeError: 'str' object has no attribute 'find_all'

My Scrapy attempt

winedf = pd.DataFrame()

class WineSpider(scrapy.Spider):
    name = 'wine_spider'

    def start_requests(self):
        dwwa_url = "https://awards.decanter.com/DWWA/2022/search/wines?competitionType=DWWA"
        yield scrapy.Request(url=dwwa_url, callback=self.parse_front)

    def parse_front(self, response):
        table = response.xpath('//*[@id="root"]/div/div[2]/div[4]/div[2]/table')
        page_links = table.xpath('//*[@id="root"]/div/div[2]/div[4]/div[2]/div[2]/div[1]/ul/li[3]/a(@class,\
        "dwwa-page-link") @href')
        links_to_follow = page_links.extract()
        for url in links_to_follow:
            yield response.follow(url=url, callback=self.parse_pages)

    def parse_pages(self, response):
        wine_name = Selector(response=response).xpath('//*[@id="root"]/div/div[2]/div[4]/div[2]/table/tbody/\
        tr[1]/td[1]/text()').get()
        wine_name_ext = wine_name.extract().strip()
        winedf.append(wine_name_ext)
        medal = Selector(response=response).xpath('//*[@id="root"]/div/div[2]/div[4]/div[2]/table/tbody/tr[1]/\
        td[4]/text()').get()
        medal_ext = medal.extract().strip()
        winedf.append(medal_ext)

Which produces and empty df.

Any help would be greatly appreciated.

Thank you!

CodePudding user response:

Try:

import pandas as pd

url = "https://decanterresultsapi.decanter.com/api/DWWA/2022/wines/search?competitionType=DWWA"
df = pd.read_json(url)

# print last items in df:
print(df.tail().to_markdown())

Prints:

producer name id competition award score country region subRegion vintage color style priceBandLetter competitionYear competitionType
14853 Telavi Wine Cellar Marani 718257 DWWA 2022 7 86 Georgia Kakheti Kindzmarauli 2021 Red Still - Medium (between 19 and 44 g/L residual sugar) B 2022 DWWA
14854 Štrigova Muškat Žuti 716526 DWWA 2022 7 87 Croatia Continental Zagorje - Međimurje 2021 White Still - Medium (between 19 and 44 g/L residual sugar) C 2022 DWWA
14855 Kopjar Muscat žUti 717754 DWWA 2022 7 86 Croatia Continental Zagorje - Međimurje 2021 White Still - Medium (between 19 and 44 g/L residual sugar) C 2022 DWWA
14856 Cleebronn-Güglingen Blanc De Noir Fein & Fruchtig 719836 DWWA 2022 7 87 Germany Württemberg Not Applicable 2021 White Still - Medium (between 19 and 44 g/L residual sugar) B 2022 DWWA
14857 Winnice Czajkowski Thoma 8 Grand Selection 719891 DWWA 2022 6 90 Poland Not Applicable Not Applicable 2021 White Still - Medium (between 19 and 44 g/L residual sugar) D 2022 DWWA

CodePudding user response:

When you load a site you want to scrape, always inspect what it loads with the network monitor. In this case you see that it loads the data dynamically from an api. This means that you can skip scraping altogether and load the data directly from the api into pandas:

import pandas as pd

df = pd.read_json('https://decanterresultsapi.decanter.com/api/DWWA/2022/wines/search?competitionType=DWWA')

Which gives all 14858 items:

producer name id competition award score country region subRegion vintage color style priceBandLetter competitionYear competitionType
0 Yealands Estate Wines Babydoll Sauvignon Blanc 706484 DWWA 2022 7 88 New Zealand Marlborough Not Applicable 2021 White Still - Dry (below 5 g/L residual sugar) A 2022 DWWA
1 Yealands Estate Wines Reserve Pinot Gris 706478 DWWA 2022 7 86 New Zealand Marlborough Not Applicable 2021 White Still - Dry (below 5 g/L residual sugar) B 2022 DWWA
2 Yealands Estate Wines Babydoll Pinot Gris 706479 DWWA 2022 7 87 New Zealand Marlborough Not Applicable 2021 White Still - Dry (below 5 g/L residual sugar) A 2022 DWWA
3 Yealands Estate Wines Reserve Chardonnay 705165 DWWA 2022 6 90 New Zealand Hawke's Bay Not Applicable 2021 White Still - Dry (below 5 g/L residual sugar) B 2022 DWWA
4 Yealands Estate Wines Reserve Sauvignon Blanc 706486 DWWA 2022 6 90 New Zealand Marlborough Awatere Valley 2021 White Still - Dry (below 5 g/L residual sugar) B 2022 DWWA
  • Related