I'm trying to scrape a table using bs and on one of the columns, there can be more than one link or href, such as the below example.
<td >
<a href="https://smallcaps.com.au/andean-mining-ipo-colombia-exploration-high-grade-copper-gold-target/" rel="noopener noreferrer" target="_blank">Article</a> /
<a href="https://www.youtube.com/watch?v=Kgew7tuLWCg" rel="noopener noreferrer" target="_blank">Video</a> /
<a href="https://andeanmining.com.au/" rel="noopener noreferrer" target="_blank">Website</a></td>
I am using the below code to locate the them however this only returns the first href, and doesn't return any of the others for rows that have more than one href.
from time import sleep
import numpy as np
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
import pandas as pd
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 3000)
from bs4 import BeautifulSoup
# Scrape the smallcaps website for IPO Information and save into dataframe
smallcaps_URL = "https://smallcaps.com.au/upcoming-ipos/"
service = Service("C:\Development\chromedriver_win32\chromedriver.exe")
chrome_options = webdriver.ChromeOptions()
driver = webdriver.Chrome(service=service)
driver.get(smallcaps_URL)
sleep(3)
close_popup = driver.find_element(By.CLASS_NAME, "tve_ea_thrive_leads_form_close")
close_popup.click()
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
all_ipo_header = soup.find_all("th")
all_ipo_content = soup.find_all("td")
ipo_headers = []
ipo_contents = []
for header in all_ipo_header:
ipo_headers.append(header.text.replace(" ", "_"))
for content in all_ipo_content:
if content.a:
a = content.find('a', href=True);
ipo_contents.append(a['href'])
else:
ipo_contents.append(content.text)
# Prints complete scraped dataframe from SmallCaps website
df = pd.DataFrame(np.reshape(ipo_contents, (-1, 6)), columns=ipo_headers)
print(df)
# Next thing to do is scrape a few other websites for comparison and remove duplicates.
Current output
Company_name ASX_code Issue_price Raise Focus Information
0 Allup Silica (TBA) APS $0.20 $5m Silica sand https://allupsilica.com/
1 Andean Mining (14 Feb) ADM $0.20 $6m Mineral exploration https://smallcaps.com.au/andean-mining-ipo-col...
2 Catalano Seafood (24 Feb) CSF $0.20 $6m Seafood https://www.catalanos.net.au/
3 Dragonfly Biosciences (TBA) DRF $0.20 $11m Cannabidiol oil https://dragonflybiosciences.com/
4 Equity Story Group (18 Mar) EQS $0.20 $5.5m Market advice & research https://equitystory.com.au/
5 Far East Gold (TBA) FEG $0.20 $12m Mineral exploration https://smallcaps.com.au/far-east-gold-asx-ipo...
6 Killi Resources (10 Feb) KLI $0.20 $6m Gold and copper https://www.killi.com.au/
7 Lukin Resources (TBA) LKN $0.20 $7.5m Mineral exploration https://smallcaps.com.au/lukin-resources-launc...
8 Many Peaks Gold (2 Mar) MPG $0.20 $5.5m Mineral exploration https://manypeaks.com.au/
9 Norfolk Metals (14 Mar) NFL $0.20 $5.5m Gold and uranium https://norfolkmetals.com.au/
10 Omnia Metals Group (21 Feb) OM1 $0.20 $5.5m Mineral exploration https://www.omniametals.com.au/
11 Pure Resources (16 Mar) PR1 $0.20 $4.6m Mineral exploration http://www.pureresources.com.au/
12 Pinnacle Minerals (11 Mar) PIM $0.20 $5.5m Kaolin - Haloysite https://pinnacleminerals.com.au/
13 Stelar Metals (7 Mar) SLB $0.20 $7m Copper and zinc https://stelarmetals.com.au/
14 Top End Energy (21 Mar) TEE $0.20 $6.4m Oil and gas http://www.topendenergy.com.au/
15 US Student Housing REIT (TBA) USQ $1.38 $45m US student accommodation https://usq-reit.com/
Process finished with exit code 0
The expected output should have three links/hrefs for some rows the 'Information' column, however it is only returning the first link/href for all of them. Could someone please guide me in the right direction?
CodePudding user response:
a = content.find('a', href=True);
This should probably also be a find_all, if there is more than one, so:
a = content.find_all('a', href=True);
CodePudding user response:
The below seems to work - it will look for all href items within content.a to allow multiple hrefs where available.
for content in all_ipo_content:
if content.a:
all_urls = [content.get("href") for content in content.find_all('a')]
ipo_contents.append(all_urls)