Home > Enterprise >  How to scrape company names from inc5000?
How to scrape company names from inc5000?

Time:07-06

I am trying to scrape all company names from inc5000 site ("https://www.inc.com/inc5000/2021"). The problem is that the company names are displayed using JavaScript. I have tried using selenium and requests_html both to render the site but still when I fetch source code of page i get JavaScript. This is what I tried. I am new to web scraping so it is possible that I am making some foolish mistake. please guide

Here is my code.

...

from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
options = Options()
options.headless = True

driver = webdriver.Chrome(ChromeDriverManager().install(),options=options)
driver.get("https://www.inc.com/inc5000/2021")
data=driver.page_source
print(data)
...

CodePudding user response:

You could give some time to render or use seleniums waits:

...
import time

driver.get('https://www.inc.com/inc5000/2021')
time.sleep(5)
data = driver.page_source

soup = BeautifulSoup(data)

for e in soup.select('.company'):
    print(e.text)
...

CodePudding user response:

Why do you need beautiful soup, you just could use selenium:

from selenium import webdriver
from selenium.webdriver.common.by import By

driver = webdriver.Chrome()
driver.get("https://www.inc.com/inc5000/2021")

companies = [e.text for e in driver.find_elements(By.CLASS_NAME, "company")]

This will only give you the elements in the viewport. You need to improve on that by scrolling.

  • Related