Home > OS >  Scraping a table w/ BeautifulSoup
Scraping a table w/ BeautifulSoup

Time:07-07

I'm new to scraping, and I've been fighting with this table for hours. I'm trying to get a couple pieces of information from exhibitors at an upcoming conference, and was wondering if someone could please help me.

Code:

profile = requests.get('https://annual.asaecenter.org/profile.cfm?profile_name=exhibitor&master_key=EF74CF1F-95BA-EC11-80F4-EC7F36E6C06A&inv_mast_key=93A17E5D-A46F-F21E-77E4-77B38A3B30EE')
soup = bs(profile.content, 'html.parser')

tds = soup.find_all("td")

print(tds)

Output:

[<td style="width: 60%;">
                        Alpharetta Convention and Visitors Bureau
                </td>, <td style="width: 40%; text-align: right;">

                                 

                                Booth 2116
                </td>, <td  colspan="2">
<div>
</div>
</td>, <td style="width: 40%;">
                Alpharetta Convention and Visitors Bureau                                                           <br/>
</td>, <td style="width: 60%;" valign="top">
</td>, <td colspan="2"><b>Sales Contact</b><br/>
                Beth Brown<br/>
                Vice President of Sales
                </td>, <td colspan="2">
<a  href="javascript:Pops('http://www.awesomealpharetta.com','website',750,650)">
<i aria-hidden="true"  title="#xlink_label#"></i><br/>Website
                </a>
</td>, <td colspan="2">
                                Description
                        </td>, <td  colspan="2" valign="top">
     Alpharetta, GA has 30 hotels w/ 3,940   guest rooms, 44,000 sq. ft. conference center, 200  restaurants, 250  shops, &amp; 40  attractions for your attendees
        </td>, <td align="left"  style="vertical-align:text-top; font-weight:bold; width: 175px">
<label for="TBE573973_4058_EC11_80F3_D9AE4409EDD7ID" id="ROW1780B7E3F-A0FD-41FD-BB77-FD8AD8F6356ELabel">Product Categories<label>
</label></label></td>, <td  colspan="1" valign="top">
<a href="/profile.cfm?profile_name=match_exhibitor&amp;answer_key=D6573973-4058-EC11-80F3-D9AE4409EDD7&amp;xtemplate">

Desired output:

name = Alpharetta Convention and Visitors Bureau
booth = 2116
url = http://www.awesomealpharetta.com
description = Alpharetta, GA has 30 hotels w/ 3,940   guest rooms, 44,000 sq. ft. conference center, 200  restaurants, 250  shops, &amp; 40  attractions for your attendees

These are the XPath locations of each desired output:

name = //*[@id="exhibitor-profile"]/tbody/tr[1]/td[1]
booth = //*[@id="exhibitor-profile"]/tbody/tr[1]/td[2]
website = //*[@id="exhibitor-profile"]/tbody/tr[3]/td/a
description = //*[@id="ROW1466DD5DF-0695-4D68-B221-4941A5171EAB"]/td

Thanks!!

CodePudding user response:

To get the desired output, you can try:

import requests
import re
from bs4 import BeautifulSoup


response = requests.get(
    "https://annual.asaecenter.org/profile.cfm?profile_name=exhibitor&master_key=EF74CF1F-95BA-EC11-80F4-EC7F36E6C06A&inv_mast_key=93A17E5D-A46F-F21E-77E4-77B38A3B30EE"
)

soup = BeautifulSoup(response.text, "html.parser")
data = soup.select_one("div.profile_contianer")
all_data = data.get_text(strip=True, separator="|").split("|")
print(all_data[0])
print(all_data[1])
print(all_data[all_data.index("Description")   1])
print(re.search(r"\('(.*?)'", str(data)).group(1))

Print:

Alpharetta Convention and Visitors Bureau
Booth 2116
Alpharetta, GA has 30 hotels w/ 3,940   guest rooms, 44,000 sq. ft. conference center, 200  restaurants, 250  shops, & 40  attractions for your attendees
http://www.awesomealpharetta.com

CodePudding user response:

xpath might not have been working out because the person who wrote the tables used same id for multiple tables! That's why your script is probably failing. Here's an alternative way to get the data:

page_url = "https://annual.asaecenter.org/profile.cfm?profile_name=exhibitor&master_key=EF74CF1F-95BA-EC11-80F4-EC7F36E6C06A&inv_mast_key=93A17E5D-A46F-F21E-77E4-77B38A3B30EE"
response = requests.get(page_url).text
soup = BeautifulSoup(response, 'lxml')

tables = soup.find_all(id="exhibitor-profile")
tds = tables[0].find_all_next('td')
name = tds[0].text.strip()
booth = tds[1].text.strip()

url = tables[1].find_next('a').get('href').split("'")[1]
description = tables[2].find_next(id="ROW1466DD5DF-0695-4D68-B221-4941A5171EAB").find_next('td').text.strip()

print(name)
print(booth)
print(url)
print(description)

Output:

Alpharetta Convention and Visitors Bureau
Booth 2116
http://www.awesomealpharetta.com
Alpharetta, GA has 30 hotels w/ 3,940   guest rooms, 44,000 sq. ft. conference center, 200  restaurants, 250  shops, & 40  attractions for your attendees
  • Related