I was trying to get some information for my research on startups from Startup Blink website(https://www.startupblink.com/startups), and here is my code
import requests
import pandas as pd
import urllib.request
import matplotlib.pyplot as plt
import seaborn as sns
import folium
from bs4 import BeautifulSoup
from time import sleep
from time import time
%time
df=pd.DataFrame()
for p in range(1,770):
url=f'https://www.startupblink.com/startups?page={p}&location=united-states'
r=requests.get(url)
us=r.text
soup=BeautifulSoup(us, 'html.parser')
allbus=soup.find_all('div', class_='sc-2ozyz3-0 jlGOJO entity-card laptop:test')
for bus in allbus:
business_name=bus.find('a', class_='sc-2ozyz3-3 bPSWdR').text
city=bus.find('div', class_='sc-2ozyz3-4 iNXPUy').find('a').text
industry=bus.find_all('div', class_='sc-2ozyz3-4 iNXPUy')[1].find_all('a')[0].text
industryspec=bus.find_all('div', class_='sc-2ozyz3-4 iNXPUy')[1].find_all('a')[1].text
description=bus.find('div', class_='sc-2ozyz3-9 gHVzj').text
description=description.rstrip('\xa0Read more')
df = df.append({"Business_name": business_name, "City": city, "Industry": industry, 'Industry Specific': industryspec, 'Description': description}, ignore_index=True)
sleep(0.01)
print(p)
df=df.dropna()
df=df.drop_duplicates()
df.describe()
Unfortunately, I was not able to figure out how to better approach it so that to get all information I need directly from the page without that inner for loop I made which goes through the page several times and it takes too much time.
Any suggestions???
Also, I cannot yet understand how to get the country name from the output HTML tag (it is the second in div :
a href="/startups/qiwi">QIWI</a>
<div ><div ></div>
<a href="/startupecosystem/moscow russia">Moscow</a>,
<a href="/startupecosystem/russia">Russia</a></div>
<div ><div ></div>
<a href="/startups/industry/fintech">
Appreciate your help and advice!
CodePudding user response:
That page is being hydrated from an API, visible in browser's Dev tools - Network tab: you need to scrape that API endpoint, to get the information. Here is one way to do it:
import requests
import pandas as pd
from tqdm import tqdm
s = requests.Session()
big_df = pd.DataFrame()
for x in tqdm(range(27)):
r = s.get(f'https://www.startupblink.com/api/entities?entity=startups&page={x}&bounds=-48.58314637707078,-177.71484375,80.2661234640419,-6.152343750000001&sortBy=rank&order=desc&leaderType=1&countryId=1')
df = pd.json_normalize(r.json()['page'])
big_df = pd.concat([big_df, df], axis=0, ignore_index=True)
print(big_df)
Result in terminal:
id title description lat lng unicorn import_tag update_method lowtech pantheon exit slugNumber cb_logo url_rank local_rank stage featured when industry_slug industry_name industry_id subindustry_slug subindustry_id subindustry_name tags tags_name logo url crunchbase linkedin_url city city_slug country_slug country state city_id country_id state_id status highest_rank location claimed_by region_ids city_bounds country_bounds region_name region_bounds region_id cluster_parent
0 4227 DuckDuckGo DuckDuckGo is a general search engine with:\n --No tracking.\n --Better instant answers.\n --Way less spam and clutter.\n\nMore at https://duckduckgo.com/press/ 40.0025 -75.118 0 angellist angellist 0 0 0 0 None 981818.18181818176526576281 2 NaN NaN 1397184129 software-data Software & Data 10 software 80.0 Software 365 Search https://www.startupblink.com/uploads/startups_logo/3c3044925df3260f03ce454bf947349c.jpg https://duckduckgo.com/ None None Philadelphia philadelphia united-states United States PA 154 1 54.0 1 10677525 Philadelphia, United States NaN 4,43,37,15 39.8670041,-75.280303,40.1379919,-74.9557629 17.5749789,-142.2328836,55.7358113,-50.7387430 North America 7.0717019,174.8410874,72.6976623,-8.1392116 4 154.0
1 33033 Medium Medium is rethinking how ideas and storied are shared with the world. We believe: \n\n- Great ideas can come from anywhere\n- People create better things together\n- Design matters at a deep level\n\nWe also care deeply about how media shapes the lives of individuals and the decisions of society — and we think it can be better. \n\nWe have a world-class engineering and design team, which we are looking to grow slowly and deliberately. Let us know if you're interested. 37.7749 -122.419 0 angellist angellist 0 0 0 0 None 514218.05752427189145237207 3 NaN NaN 1397182260 software-data Software & Data 10 apps 72.0 Apps 267 Mobile https://www.startupblink.com/uploads/startups_logo/77cc196151a296effc9295ab70da4302.jpg http://medium.com/ None http://www.linkedin.com/company/medium-com San Francisco san-francisco united-states United States CA 5 1 25.0 1 10677525 San Francisco, United States NaN 4,43,37,15 37.6933354,-123.1077733,37.9297707,-122.3279149 17.5749789,-142.2328836,55.7358113,-50.7387430 North America 7.0717019,174.8410874,72.6976623,-8.1392116 4 5.0
2 176060 Eventbrite Eventbrite brings people together around the power of live events. Founded in 2006, the innovative ticketing, registration, and event discovery platform has sold more than 140M tickets in 176 countries, and processed over $2B in gross ticket sales (25% of the in the last six months). We’re transforming the ticketing and registration industry from the ground up, and we're looking for amazing people to help us change the way people get together. 37.7749 -122.419 0 angellist angellist 0 0 0 0 None 454876.68161434977082535625 4 NaN NaN 1397189157 software-data Software & Data 10 apps 72.0 Apps 267 Mobile https://www.startupblink.com/uploads/startups_logo/1c8ed51b74f154a7eb29fdb881417fb2.jpg http://www.eventbrite.com/ None http://www.linkedin.com/company/eventbrite San Francisco san-francisco united-states United States CA 5 1 25.0 1 10677525 San Francisco, United States NaN 4,43,37,15 37.6933354,-123.1077733,37.9297707,-122.3279149 17.5749789,-142.2328836,55.7358113,-50.7387430 North America 7.0717019,174.8410874,72.6976623,-8.1392116 4 5.0
3 282599 FTX Exchange FTX Exchange is a cryptocurrency derivatives exchange company built by traders, for traders. 37.7749 -122.419 0 massive_CB_import21_2018 any 0 0 0 0 /image/upload/v3wgeajl4zaccve2fqgh 370193.95945386844687163830 5 NaN NaN 1612865162 fintech Fintech 4 cryptocurrency 20.0 Cryptocurrency None None https://res.cloudinary.com/crunchbase-production/image/upload/vqz68owblsgchsqpyjzm https://ftx.com/ https://www.crunchbase.com/organization/ftx-exchange None San Francisco san-francisco united-states United States CA 5 1 25.0 1 10677525 San Francisco, United States NaN 4,43,37,15 37.6933354,-123.1077733,37.9297707,-122.3279149 17.5749789,-142.2328836,55.7358113,-50.7387430 North America 7.0717019,174.8410874,72.6976623,-8.1392116 4 5.0
4 341985 JUUL JUUL is a manufacturer and distributor of electronic nicotine vaporizers. 37.7749 -122.419 0 massive_CB_2022 any 0 0 0 0 /image/upload/v1429671971/po5mfc1lakppkxasfvaz.png 343775.01932146179024130106 6 NaN 0.0 1642957296 social-leisure Social & Leisure 9 social-leisure-other 68.0 Social & Leisure-Other None None None https://www.juul.com None None San Francisco san-francisco united-states United States CA 5 1 25.0 1 10677525 San Francisco, United States NaN 4,43,37,15 37.6933354,-123.1077733,37.9297707,-122.3279149 17.5749789,-142.2328836,55.7358113,-50.7387430 North America 7.0717019,174.8410874,72.6976623,-8.1392116 4 5.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1291 299976 AiCure AiCure is an advanced data analytics company that uses artificial intelligence to understand how patients respond to treatments. 40.7128 -74.006 0 massive_CB_2022 any 0 0 0 0 None 235.74892181180308625699 2372 NaN 0.0 1642945905 software-data Software & Data 10 data-analytics 77.0 Data Analytics None None None http://www.aicure.com None None New York new-york united-states United States NY 15 1 27.0 1 10677525 New York, United States None 4,43,37,15 40.4959961,-74.2590879,40.9152556,-73.7002721 17.5749789,-142.2328836,55.7358113,-50.7387430 North America 7.0717019,174.8410874,72.6976623,-8.1392116 4 15.0
1292 262861 Primary Primary is making better clothes for kids and building a better experience for busy parents to shop for them. 40.7128 -74.006 0 massive_CB_import21_2015 any 0 0 0 0 /image/upload/v1427864328/d3eplpf1udmzamqlxbok.png 235.54787246262657163243 2373 NaN NaN 1612862505 ecommerce-retail Ecommerce & Retail 1 ecommerce 2.0 Ecommerce None None https://res.cloudinary.com/crunchbase-production/image/upload/v1427864328/d3eplpf1udmzamqlxbok.png https://www.primary.com/ https://www.crunchbase.com/organization/primary None New York new-york united-states United States NY 15 1 27.0 1 10677525 New York, United States None 4,43,37,15 40.4959961,-74.2590879,40.9152556,-73.7002721 17.5749789,-142.2328836,55.7358113,-50.7387430 North America 7.0717019,174.8410874,72.6976623,-8.1392116 4 15.0
1293 275123 OLIPOP OLIPOP is the clinically backed consumer beverage that meets consumer’s real-world taste preferences in a delicious tonic. 37.8044 -122.271 0 massive_CB_import21_2017 any 0 0 0 0 /image/upload/yx6qdieek1mffmbjrph0 235.45512740329783696325 2374 NaN NaN 1612864255 foodtech Foodtech 5 food-and-beverage 32.0 Food and Beverage None None https://res.cloudinary.com/crunchbase-production/image/upload/yx6qdieek1mffmbjrph0 https://www.drinkolipop.com/ https://www.crunchbase.com/organization/olipop None Oakland oakland united-states United States CA 348 1 25.0 1 10677525 Oakland, United States None 4,43,37,15 37.699192,-122.3426648,37.8847249,-122.1149234 17.5749789,-142.2328836,55.7358113,-50.7387430 North America 7.0717019,174.8410874,72.6976623,-8.1392116 4 5.0
1294 240892 SecuredTouch Solving real-world authentication problems to support digital transformation into the “mobile era” 37.4419 -122.143 0 crunchbase crunchbase 0 0 0 0 /image/upload/v1492674022/pkuky18gpvm5m6fkef79.png 235.38222864173755510819 2376 NaN NaN 1569702672 fintech Fintech 4 fintech-other 23.0 Fintech-Other None None https://res.cloudinary.com/crunchbase-production/image/upload/v1492674022/pkuky18gpvm5m6fkef79.png http://www.securedtouch.com/ https://www.crunchbase.com/organization/securedtouch https://www.linkedin.com/company-beta/9187630/ Palo Alto palo-alto united-states United States CA 77 1 25.0 1 10677525 Palo Alto, United States None 4,43,37,15 37.2853458,-122.202476,37.4659713,-122.0867789 17.5749789,-142.2328836,55.7358113,-50.7387430 North America 7.0717019,174.8410874,72.6976623,-8.1392116 4 5.0
1295 48632 Womply Womply brings online tools like Google Analytics, Compete.com & Salesforce to offline merchants. \n\nWomply lets merchants:\n-visualize their revenue, social media, & online reputation performance\n-compare performance to competitors\n-identify their best customers\n-see where else customers spend\n-engage customers automatically via email/mobile to drive revenue\n\nWomply is special because it runs in the cloud: no hardware to install, no software to integrate, no training, & no Δ in payment behavior. 37.7749 -122.419 0 angellist angellist 0 0 0 0 None 235.07372273596880063451 2377 NaN NaN 1397199100 software-data Software & Data 10 data-analytics 77.0 Data Analytics None None https://www.startupblink.com/uploads/startups_logo/88c6ef3c716188d445c0f39bc40107c6.jpg https://womply.com/insights None https://www.linkedin.com/company/womply San Francisco san-francisco united-states United States CA 5 1 25.0 1 10677525 San Francisco, United States None 4,43,37,15 37.6933354,-123.1077733,37.9297707,-122.3279149 17.5749789,-142.2328836,55.7358113,-50.7387430 North America 7.0717019,174.8410874,72.6976623,-8.1392116 4 5.0
1296 rows × 49 columns
For TQDM visit https://pypi.org/project/tqdm/
For Requests documentation, see https://requests.readthedocs.io/en/latest/
Also for pandas: https://pandas.pydata.org/pandas-docs/stable/index.html
CodePudding user response:
You can just use the API and take the fields you need.
import requests
import pandas as pd
results = []
for page in range(770):
url = f"https://www.startupblink.com/api/entities?entity=startups&page={page}&sortBy=rank&order=desc&leaderType=1"
response = requests.get(url)
for business in response.json()['page']:
results.append({
'title': business['title'],
'city': business['city'],
'industry_name': business['industry_name'],
'subindustry_name': business['subindustry_name'],
'description': business['description']
})
df = pd.DataFrame(results)
print(df.to_string(index=False))
OUTPUT:
title city industry_name subindustry_name description
GrabFood London Ecommerce & Retail Ecommerce GrabFood is a same-day grocery delivery company, offering delivery in as little as one hour.
DuckDuckGo Philadelphia Software & Data Software DuckDuckGo is a general search engine with:\n --No tracking.\n --Better instant answers.\n --Way less spam and clutter.\n\nMore at https://duckduckgo.com/press/
Medium San Francisco Software & Data Apps Medium is rethinking how ideas and storied are shared with the world. We believe: \n\n- Great ideas can come from anywhere\n- People create better things together\n- Design matters at a deep level\n\nWe also care deeply about how media shapes the lives of individuals and the decisions of society — and we think it can be better. \n\nWe have a world-class engineering and design team, which we are looking to grow slowly and deliberately. Let us know if you're interested.
Eventbrite San Francisco Software & Data Apps Eventbrite brings people together around the power of live events. Founded in 2006, the innovative ticketing, registration, and event discovery platform has sold more than 140M tickets in 176 countries, and processed over $2B in gross ticket sales (25% of the in the last six months). We’re transforming the ticketing and registration industry from the ground up, and we're looking for amazing people to help us change the way people get together.
...