Trying to web scrape text from a table on a website-CodePudding

I am a novice at this, but I've been trying to scrape data on a website (https://awards.decanter.com/DWWA/2022/search/wines?competitionType=DWWA) but I keep coming up empty. I've tried BeautifulSoup and Scrapy but I can't get the text out.

Eventually I want to get the row of each individual wine in the table into a dataframe/csv (from all pages) but currently I can't even get the first wine producer name.

If you inspect the webpage all the details are in tags with no id or class.

My BeautifulSoup attempt

URL = 'https://awards.decanter.com/DWWA/2022/search/wines?competitionType=DWWA'
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) \
Chrome/106.0.0.0 Safari/537.36 Edg/106.0.1370.52"}

page = requests.get(URL, headers=headers)

soup = BeautifulSoup(page.content, "html.parser")
soup2 = soup.prettify()

producer = soup2.find_all('td').get_text()

print(producer)

Which is throwing the error:

producer = soup2.find_all('td').get_text()
AttributeError: 'str' object has no attribute 'find_all'

My Scrapy attempt

winedf = pd.DataFrame()

class WineSpider(scrapy.Spider):
    name = 'wine_spider'

    def start_requests(self):
        dwwa_url = "https://awards.decanter.com/DWWA/2022/search/wines?competitionType=DWWA"
        yield scrapy.Request(url=dwwa_url, callback=self.parse_front)

    def parse_front(self, response):
        table = response.xpath('//*[@id="root"]/div/div[2]/div[4]/div[2]/table')
        page_links = table.xpath('//*[@id="root"]/div/div[2]/div[4]/div[2]/div[2]/div[1]/ul/li[3]/a(@class,\
        "dwwa-page-link") @href')
        links_to_follow = page_links.extract()
        for url in links_to_follow:
            yield response.follow(url=url, callback=self.parse_pages)

    def parse_pages(self, response):
        wine_name = Selector(response=response).xpath('//*[@id="root"]/div/div[2]/div[4]/div[2]/table/tbody/\
        tr[1]/td[1]/text()').get()
        wine_name_ext = wine_name.extract().strip()
        winedf.append(wine_name_ext)
        medal = Selector(response=response).xpath('//*[@id="root"]/div/div[2]/div[4]/div[2]/table/tbody/tr[1]/\
        td[4]/text()').get()
        medal_ext = medal.extract().strip()
        winedf.append(medal_ext)

Which produces and empty df.

Any help would be greatly appreciated.

Thank you!

CodePudding user response：

Try:

import pandas as pd

url = "https://decanterresultsapi.decanter.com/api/DWWA/2022/wines/search?competitionType=DWWA"
df = pd.read_json(url)

# print last items in df:
print(df.tail().to_markdown())

Prints:

	producer	name	id	competition	award	score	country	region	subRegion	vintage	color	style	priceBandLetter	competitionYear	competitionType
14853	Telavi Wine Cellar	Marani	718257	DWWA 2022	7	86	Georgia	Kakheti	Kindzmarauli	2021	Red	Still - Medium (between 19 and 44 g/L residual sugar)	B	2022	DWWA
14854	Štrigova	Muškat Žuti	716526	DWWA 2022	7	87	Croatia	Continental	Zagorje - Međimurje	2021	White	Still - Medium (between 19 and 44 g/L residual sugar)	C	2022	DWWA
14855	Kopjar	Muscat žUti	717754	DWWA 2022	7	86	Croatia	Continental	Zagorje - Međimurje	2021	White	Still - Medium (between 19 and 44 g/L residual sugar)	C	2022	DWWA
14856	Cleebronn-Güglingen	Blanc De Noir Fein & Fruchtig	719836	DWWA 2022	7	87	Germany	Württemberg	Not Applicable	2021	White	Still - Medium (between 19 and 44 g/L residual sugar)	B	2022	DWWA
14857	Winnice Czajkowski	Thoma 8 Grand Selection	719891	DWWA 2022	6	90	Poland	Not Applicable	Not Applicable	2021	White	Still - Medium (between 19 and 44 g/L residual sugar)	D	2022	DWWA

CodePudding user response：

When you load a site you want to scrape, always inspect what it loads with the network monitor. In this case you see that it loads the data dynamically from an api. This means that you can skip scraping altogether and load the data directly from the api into pandas:

import pandas as pd

df = pd.read_json('https://decanterresultsapi.decanter.com/api/DWWA/2022/wines/search?competitionType=DWWA')

Which gives all 14858 items:

	producer	name	id	competition	award	score	country	region	subRegion	vintage	color	style	priceBandLetter	competitionYear	competitionType
0	Yealands Estate Wines	Babydoll Sauvignon Blanc	706484	DWWA 2022	7	88	New Zealand	Marlborough	Not Applicable	2021	White	Still - Dry (below 5 g/L residual sugar)	A	2022	DWWA
1	Yealands Estate Wines	Reserve Pinot Gris	706478	DWWA 2022	7	86	New Zealand	Marlborough	Not Applicable	2021	White	Still - Dry (below 5 g/L residual sugar)	B	2022	DWWA
2	Yealands Estate Wines	Babydoll Pinot Gris	706479	DWWA 2022	7	87	New Zealand	Marlborough	Not Applicable	2021	White	Still - Dry (below 5 g/L residual sugar)	A	2022	DWWA
3	Yealands Estate Wines	Reserve Chardonnay	705165	DWWA 2022	6	90	New Zealand	Hawke's Bay	Not Applicable	2021	White	Still - Dry (below 5 g/L residual sugar)	B	2022	DWWA
4	Yealands Estate Wines	Reserve Sauvignon Blanc	706486	DWWA 2022	6	90	New Zealand	Marlborough	Awatere Valley	2021	White	Still - Dry (below 5 g/L residual sugar)	B	2022	DWWA