I am a novice at this, but I've been trying to scrape data on a website (https://awards.decanter.com/DWWA/2022/search/wines?competitionType=DWWA) but I keep coming up empty. I've tried BeautifulSoup and Scrapy but I can't get the text out.
Eventually I want to get the row of each individual wine in the table into a dataframe/csv (from all pages) but currently I can't even get the first wine producer name.
If you inspect the webpage all the details are in tags with no id or class.
My BeautifulSoup attempt
URL = 'https://awards.decanter.com/DWWA/2022/search/wines?competitionType=DWWA'
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) \
Chrome/106.0.0.0 Safari/537.36 Edg/106.0.1370.52"}
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, "html.parser")
soup2 = soup.prettify()
producer = soup2.find_all('td').get_text()
print(producer)
Which is throwing the error:
producer = soup2.find_all('td').get_text()
AttributeError: 'str' object has no attribute 'find_all'
My Scrapy attempt
winedf = pd.DataFrame()
class WineSpider(scrapy.Spider):
name = 'wine_spider'
def start_requests(self):
dwwa_url = "https://awards.decanter.com/DWWA/2022/search/wines?competitionType=DWWA"
yield scrapy.Request(url=dwwa_url, callback=self.parse_front)
def parse_front(self, response):
table = response.xpath('//*[@id="root"]/div/div[2]/div[4]/div[2]/table')
page_links = table.xpath('//*[@id="root"]/div/div[2]/div[4]/div[2]/div[2]/div[1]/ul/li[3]/a(@class,\
"dwwa-page-link") @href')
links_to_follow = page_links.extract()
for url in links_to_follow:
yield response.follow(url=url, callback=self.parse_pages)
def parse_pages(self, response):
wine_name = Selector(response=response).xpath('//*[@id="root"]/div/div[2]/div[4]/div[2]/table/tbody/\
tr[1]/td[1]/text()').get()
wine_name_ext = wine_name.extract().strip()
winedf.append(wine_name_ext)
medal = Selector(response=response).xpath('//*[@id="root"]/div/div[2]/div[4]/div[2]/table/tbody/tr[1]/\
td[4]/text()').get()
medal_ext = medal.extract().strip()
winedf.append(medal_ext)
Which produces and empty df.
Any help would be greatly appreciated.
Thank you!
CodePudding user response:
Try:
import pandas as pd
url = "https://decanterresultsapi.decanter.com/api/DWWA/2022/wines/search?competitionType=DWWA"
df = pd.read_json(url)
# print last items in df:
print(df.tail().to_markdown())
Prints:
producer | name | id | competition | award | score | country | region | subRegion | vintage | color | style | priceBandLetter | competitionYear | competitionType | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
14853 | Telavi Wine Cellar | Marani | 718257 | DWWA 2022 | 7 | 86 | Georgia | Kakheti | Kindzmarauli | 2021 | Red | Still - Medium (between 19 and 44 g/L residual sugar) | B | 2022 | DWWA |
14854 | Štrigova | Muškat Žuti | 716526 | DWWA 2022 | 7 | 87 | Croatia | Continental | Zagorje - Međimurje | 2021 | White | Still - Medium (between 19 and 44 g/L residual sugar) | C | 2022 | DWWA |
14855 | Kopjar | Muscat žUti | 717754 | DWWA 2022 | 7 | 86 | Croatia | Continental | Zagorje - Međimurje | 2021 | White | Still - Medium (between 19 and 44 g/L residual sugar) | C | 2022 | DWWA |
14856 | Cleebronn-Güglingen | Blanc De Noir Fein & Fruchtig | 719836 | DWWA 2022 | 7 | 87 | Germany | Württemberg | Not Applicable | 2021 | White | Still - Medium (between 19 and 44 g/L residual sugar) | B | 2022 | DWWA |
14857 | Winnice Czajkowski | Thoma 8 Grand Selection | 719891 | DWWA 2022 | 6 | 90 | Poland | Not Applicable | Not Applicable | 2021 | White | Still - Medium (between 19 and 44 g/L residual sugar) | D | 2022 | DWWA |
CodePudding user response:
When you load a site you want to scrape, always inspect what it loads with the network monitor. In this case you see that it loads the data dynamically from an api. This means that you can skip scraping altogether and load the data directly from the api into pandas:
import pandas as pd
df = pd.read_json('https://decanterresultsapi.decanter.com/api/DWWA/2022/wines/search?competitionType=DWWA')
Which gives all 14858 items:
producer | name | id | competition | award | score | country | region | subRegion | vintage | color | style | priceBandLetter | competitionYear | competitionType | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Yealands Estate Wines | Babydoll Sauvignon Blanc | 706484 | DWWA 2022 | 7 | 88 | New Zealand | Marlborough | Not Applicable | 2021 | White | Still - Dry (below 5 g/L residual sugar) | A | 2022 | DWWA |
1 | Yealands Estate Wines | Reserve Pinot Gris | 706478 | DWWA 2022 | 7 | 86 | New Zealand | Marlborough | Not Applicable | 2021 | White | Still - Dry (below 5 g/L residual sugar) | B | 2022 | DWWA |
2 | Yealands Estate Wines | Babydoll Pinot Gris | 706479 | DWWA 2022 | 7 | 87 | New Zealand | Marlborough | Not Applicable | 2021 | White | Still - Dry (below 5 g/L residual sugar) | A | 2022 | DWWA |
3 | Yealands Estate Wines | Reserve Chardonnay | 705165 | DWWA 2022 | 6 | 90 | New Zealand | Hawke's Bay | Not Applicable | 2021 | White | Still - Dry (below 5 g/L residual sugar) | B | 2022 | DWWA |
4 | Yealands Estate Wines | Reserve Sauvignon Blanc | 706486 | DWWA 2022 | 6 | 90 | New Zealand | Marlborough | Awatere Valley | 2021 | White | Still - Dry (below 5 g/L residual sugar) | B | 2022 | DWWA |