Home > Back-end >  Can't web scrap html table beautiful soup
Can't web scrap html table beautiful soup

Time:03-21

Trying to scrap IPO table data from here: https://www.iposcoop.com/last-12-months/

Here is my code:

import requests
from bs4 import BeautifulSoup
import pandas as pd

url = 'https://www.iposcoop.com/last-12-months/'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'lxml')
table1 = soup.find("table",id='DataTables_Table_0')
table1_data = table1.tbody.find_all("tr")
table1

However, table1 is NonType. Why is that? Any solution? I have read related issues, iframe doesn't seem to be the answer.

CodePudding user response:

You can grab table data using pandas

import pandas as pd
import requests 
from bs4 import BeautifulSoup

url='https://www.iposcoop.com/last-12-months'
req=requests.get(url).text
soup=BeautifulSoup(req,'lxml')
table=soup.select_one('.standard-table.ipolist')
table_data =pd.read_html(str(table))[0]
print(table_data)

Output:

                 Company  Symbol  ...   Return SCOOP Rating
0                                         Akanda Corp.    AKAN  ...   85.00%          S/O     
1    The Marygold Companies, Inc. (aka Concierge Te...    MGLD  ...    9.50%          S/O     
2                            Blue Water Vaccines, Inc.     BWV  ...  343.33%          S/O     
3            Meihua International Medical Technologies    MHUA  ...  -33.00%          S/O     
4                                        Vivakor, Inc.    VIVK  ...  -49.40%          S/O     
..                                                 ...     ...  ...      ...          ...     
355                Khosla Ventures Acquisition Co. III    KVSC  ...   -2.80%          S/O     
356           Dragoneer Growth Opportunities Corp. III    DGNU  ...   -2.40%          S/O     
357                                        Movano Inc.    MOVE  ...  -43.60%          S/O     
358         Supernova Partners Acquisition Company III  STRE.U  ...    0.10%          S/O     
359                           Universe Pharmaceuticals     UPC  ...  -74.00%          S/O     

[360 rows x 10 columns]

CodePudding user response:

While F.Hoque's answer gives you a solution, it does not seem to answer why your code throws an error.

You are trying to find a table with the id DataTables_Table_0. Opening the page in a browser, you can see that such an element with the given id exists. But if you open the same page after disabling Javascript you can see that the id no longer exists on the table. This id is being assigned by some javascript module.

BeautifulSoup can only fetch the base HTML and it does not run javascript modules. So you have 2 solutions:

  1. Use a selector that exists in the base HTML (in this case .standard-table.ipolist)
  2. Use selenium to run Javascript and fetch the HTML as it is seen in a browser
  • Related