Home > Back-end >  Startup Blink web scrapping
Startup Blink web scrapping

Time:10-04

enter image description hereHello and have a great day!

I was trying to get some information for my research on startups from Startup Blink website(https://www.startupblink.com/startups), and here is my code

import requests
import pandas as pd
import urllib.request
import matplotlib.pyplot as plt
import seaborn as sns
import folium
from bs4 import BeautifulSoup
from time import sleep
from time import time

%time
df=pd.DataFrame()

for p in range(1,770):
    url=f'https://www.startupblink.com/startups?page={p}&location=united-states'
    r=requests.get(url)
    us=r.text
    soup=BeautifulSoup(us, 'html.parser')
    
    allbus=soup.find_all('div', class_='sc-2ozyz3-0 jlGOJO entity-card laptop:test')

    for bus in allbus:
        business_name=bus.find('a', class_='sc-2ozyz3-3 bPSWdR').text
        city=bus.find('div', class_='sc-2ozyz3-4 iNXPUy').find('a').text
        industry=bus.find_all('div', class_='sc-2ozyz3-4 iNXPUy')[1].find_all('a')[0].text
        industryspec=bus.find_all('div', class_='sc-2ozyz3-4 iNXPUy')[1].find_all('a')[1].text
        description=bus.find('div', class_='sc-2ozyz3-9 gHVzj').text
        description=description.rstrip('\xa0Read more')
        df = df.append({"Business_name": business_name, "City": city, "Industry": industry, 'Industry Specific': industryspec, 'Description': description}, ignore_index=True)
        sleep(0.01)
        print(p) 
df=df.dropna()
df=df.drop_duplicates()
df.describe()

Unfortunately, I was not able to figure out how to better approach it so that to get all information I need directly from the page without that inner for loop I made which goes through the page several times and it takes too much time.

Any suggestions???

Also, I cannot yet understand how to get the country name from the output HTML tag (it is the second in div :

a  href="/startups/qiwi">QIWI</a>
<div ><div ></div>
<a href="/startupecosystem/moscow russia">Moscow</a>, 
<a href="/startupecosystem/russia">Russia</a></div>
<div ><div ></div>
<a href="/startups/industry/fintech">

Appreciate your help and advice!

CodePudding user response:

That page is being hydrated from an API, visible in browser's Dev tools - Network tab: you need to scrape that API endpoint, to get the information. Here is one way to do it:

import requests
import pandas as pd
from tqdm import tqdm

s = requests.Session()
big_df = pd.DataFrame()

for x in tqdm(range(27)):
    r = s.get(f'https://www.startupblink.com/api/entities?entity=startups&page={x}&bounds=-48.58314637707078,-177.71484375,80.2661234640419,-6.152343750000001&sortBy=rank&order=desc&leaderType=1&countryId=1')
    df = pd.json_normalize(r.json()['page'])
    big_df = pd.concat([big_df, df], axis=0, ignore_index=True)
print(big_df)

Result in terminal:

id  title   description lat lng unicorn import_tag  update_method   lowtech pantheon    exit    slugNumber  cb_logo url_rank    local_rank  stage   featured    when    industry_slug   industry_name   industry_id subindustry_slug    subindustry_id  subindustry_name    tags    tags_name   logo    url crunchbase  linkedin_url    city    city_slug   country_slug    country state   city_id country_id  state_id    status  highest_rank    location    claimed_by  region_ids  city_bounds country_bounds  region_name region_bounds   region_id   cluster_parent
0   4227    DuckDuckGo  DuckDuckGo is a general search engine with:\n --No tracking.\n --Better instant answers.\n --Way less spam and clutter.\n\nMore at https://duckduckgo.com/press/    40.0025 -75.118 0   angellist   angellist   0   0   0   0   None    981818.18181818176526576281 2   NaN NaN 1397184129  software-data   Software & Data 10  software    80.0    Software    365 Search  https://www.startupblink.com/uploads/startups_logo/3c3044925df3260f03ce454bf947349c.jpg https://duckduckgo.com/ None    None    Philadelphia    philadelphia    united-states   United States   PA  154 1   54.0    1   10677525    Philadelphia, United States NaN 4,43,37,15  39.8670041,-75.280303,40.1379919,-74.9557629    17.5749789,-142.2328836,55.7358113,-50.7387430  North America   7.0717019,174.8410874,72.6976623,-8.1392116 4   154.0
1   33033   Medium  Medium is rethinking how ideas and storied are shared with the world. We believe: \n\n- Great ideas can come from anywhere\n- People create better things together\n- Design matters at a deep level\n\nWe also care deeply about how media shapes the lives of individuals and the decisions of society — and we think it can be better. \n\nWe have a world-class engineering and design team, which we are looking to grow slowly and deliberately. Let us know if you're interested.    37.7749 -122.419    0   angellist   angellist   0   0   0   0   None    514218.05752427189145237207 3   NaN NaN 1397182260  software-data   Software & Data 10  apps    72.0    Apps    267 Mobile  https://www.startupblink.com/uploads/startups_logo/77cc196151a296effc9295ab70da4302.jpg http://medium.com/  None    http://www.linkedin.com/company/medium-com  San Francisco   san-francisco   united-states   United States   CA  5   1   25.0    1   10677525    San Francisco, United States    NaN 4,43,37,15  37.6933354,-123.1077733,37.9297707,-122.3279149 17.5749789,-142.2328836,55.7358113,-50.7387430  North America   7.0717019,174.8410874,72.6976623,-8.1392116 4   5.0
2   176060  Eventbrite  Eventbrite brings people together around the power of live events. Founded in 2006, the innovative ticketing, registration, and event discovery platform has sold more than 140M tickets in 176 countries, and processed over $2B in gross ticket sales (25% of the in the last six months). We’re transforming the ticketing and registration industry from the ground up, and we're looking for amazing people to help us change the way people get together. 37.7749 -122.419    0   angellist   angellist   0   0   0   0   None    454876.68161434977082535625 4   NaN NaN 1397189157  software-data   Software & Data 10  apps    72.0    Apps    267 Mobile  https://www.startupblink.com/uploads/startups_logo/1c8ed51b74f154a7eb29fdb881417fb2.jpg http://www.eventbrite.com/  None    http://www.linkedin.com/company/eventbrite  San Francisco   san-francisco   united-states   United States   CA  5   1   25.0    1   10677525    San Francisco, United States    NaN 4,43,37,15  37.6933354,-123.1077733,37.9297707,-122.3279149 17.5749789,-142.2328836,55.7358113,-50.7387430  North America   7.0717019,174.8410874,72.6976623,-8.1392116 4   5.0
3   282599  FTX Exchange    FTX Exchange is a cryptocurrency derivatives exchange company built by traders, for traders.    37.7749 -122.419    0   massive_CB_import21_2018    any 0   0   0   0   /image/upload/v3wgeajl4zaccve2fqgh  370193.95945386844687163830 5   NaN NaN 1612865162  fintech Fintech 4   cryptocurrency  20.0    Cryptocurrency  None    None    https://res.cloudinary.com/crunchbase-production/image/upload/vqz68owblsgchsqpyjzm  https://ftx.com/    https://www.crunchbase.com/organization/ftx-exchange    None    San Francisco   san-francisco   united-states   United States   CA  5   1   25.0    1   10677525    San Francisco, United States    NaN 4,43,37,15  37.6933354,-123.1077733,37.9297707,-122.3279149 17.5749789,-142.2328836,55.7358113,-50.7387430  North America   7.0717019,174.8410874,72.6976623,-8.1392116 4   5.0
4   341985  JUUL    JUUL is a manufacturer and distributor of electronic nicotine vaporizers.   37.7749 -122.419    0   massive_CB_2022 any 0   0   0   0   /image/upload/v1429671971/po5mfc1lakppkxasfvaz.png  343775.01932146179024130106 6   NaN 0.0 1642957296  social-leisure  Social & Leisure    9   social-leisure-other    68.0    Social & Leisure-Other  None    None    None    https://www.juul.com    None    None    San Francisco   san-francisco   united-states   United States   CA  5   1   25.0    1   10677525    San Francisco, United States    NaN 4,43,37,15  37.6933354,-123.1077733,37.9297707,-122.3279149 17.5749789,-142.2328836,55.7358113,-50.7387430  North America   7.0717019,174.8410874,72.6976623,-8.1392116 4   5.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1291    299976  AiCure  AiCure is an advanced data analytics company that uses artificial intelligence to understand how patients respond to treatments.    40.7128 -74.006 0   massive_CB_2022 any 0   0   0   0   None    235.74892181180308625699    2372    NaN 0.0 1642945905  software-data   Software & Data 10  data-analytics  77.0    Data Analytics  None    None    None    http://www.aicure.com   None    None    New York    new-york    united-states   United States   NY  15  1   27.0    1   10677525    New York, United States None    4,43,37,15  40.4959961,-74.2590879,40.9152556,-73.7002721   17.5749789,-142.2328836,55.7358113,-50.7387430  North America   7.0717019,174.8410874,72.6976623,-8.1392116 4   15.0
1292    262861  Primary Primary is making better clothes for kids and building a better experience for busy parents to shop for them.   40.7128 -74.006 0   massive_CB_import21_2015    any 0   0   0   0   /image/upload/v1427864328/d3eplpf1udmzamqlxbok.png  235.54787246262657163243    2373    NaN NaN 1612862505  ecommerce-retail    Ecommerce & Retail  1   ecommerce   2.0 Ecommerce   None    None    https://res.cloudinary.com/crunchbase-production/image/upload/v1427864328/d3eplpf1udmzamqlxbok.png  https://www.primary.com/    https://www.crunchbase.com/organization/primary None    New York    new-york    united-states   United States   NY  15  1   27.0    1   10677525    New York, United States None    4,43,37,15  40.4959961,-74.2590879,40.9152556,-73.7002721   17.5749789,-142.2328836,55.7358113,-50.7387430  North America   7.0717019,174.8410874,72.6976623,-8.1392116 4   15.0
1293    275123  OLIPOP  OLIPOP is the clinically backed consumer beverage that meets consumer&rsquo;s real-world taste preferences in a delicious tonic.    37.8044 -122.271    0   massive_CB_import21_2017    any 0   0   0   0   /image/upload/yx6qdieek1mffmbjrph0  235.45512740329783696325    2374    NaN NaN 1612864255  foodtech    Foodtech    5   food-and-beverage   32.0    Food and Beverage   None    None    https://res.cloudinary.com/crunchbase-production/image/upload/yx6qdieek1mffmbjrph0  https://www.drinkolipop.com/    https://www.crunchbase.com/organization/olipop  None    Oakland oakland united-states   United States   CA  348 1   25.0    1   10677525    Oakland, United States  None    4,43,37,15  37.699192,-122.3426648,37.8847249,-122.1149234  17.5749789,-142.2328836,55.7358113,-50.7387430  North America   7.0717019,174.8410874,72.6976623,-8.1392116 4   5.0
1294    240892  SecuredTouch    Solving real-world authentication problems to support digital transformation into the &ldquo;mobile era&rdquo;  37.4419 -122.143    0   crunchbase  crunchbase  0   0   0   0   /image/upload/v1492674022/pkuky18gpvm5m6fkef79.png  235.38222864173755510819    2376    NaN NaN 1569702672  fintech Fintech 4   fintech-other   23.0    Fintech-Other   None    None    https://res.cloudinary.com/crunchbase-production/image/upload/v1492674022/pkuky18gpvm5m6fkef79.png  http://www.securedtouch.com/    https://www.crunchbase.com/organization/securedtouch    https://www.linkedin.com/company-beta/9187630/  Palo Alto   palo-alto   united-states   United States   CA  77  1   25.0    1   10677525    Palo Alto, United States    None    4,43,37,15  37.2853458,-122.202476,37.4659713,-122.0867789  17.5749789,-142.2328836,55.7358113,-50.7387430  North America   7.0717019,174.8410874,72.6976623,-8.1392116 4   5.0
1295    48632   Womply  Womply brings online tools like Google Analytics, Compete.com & Salesforce to offline merchants. \n\nWomply lets merchants:\n-visualize their revenue, social media, & online reputation performance\n-compare performance to competitors\n-identify their best customers\n-see where else customers spend\n-engage customers automatically via email/mobile to drive revenue\n\nWomply is special because it runs in the cloud: no hardware to install, no software to integrate, no training, & no Δ in payment behavior. 37.7749 -122.419    0   angellist   angellist   0   0   0   0   None    235.07372273596880063451    2377    NaN NaN 1397199100  software-data   Software & Data 10  data-analytics  77.0    Data Analytics  None    None    https://www.startupblink.com/uploads/startups_logo/88c6ef3c716188d445c0f39bc40107c6.jpg https://womply.com/insights None    https://www.linkedin.com/company/womply San Francisco   san-francisco   united-states   United States   CA  5   1   25.0    1   10677525    San Francisco, United States    None    4,43,37,15  37.6933354,-123.1077733,37.9297707,-122.3279149 17.5749789,-142.2328836,55.7358113,-50.7387430  North America   7.0717019,174.8410874,72.6976623,-8.1392116 4   5.0
1296 rows × 49 columns

For TQDM visit https://pypi.org/project/tqdm/

For Requests documentation, see https://requests.readthedocs.io/en/latest/

Also for pandas: https://pandas.pydata.org/pandas-docs/stable/index.html

CodePudding user response:

You can just use the API and take the fields you need.

import requests
import pandas as pd

results = []
for page in range(770):
    url = f"https://www.startupblink.com/api/entities?entity=startups&page={page}&sortBy=rank&order=desc&leaderType=1"
    response = requests.get(url)
    for business in response.json()['page']:
        results.append({
            'title': business['title'],
            'city': business['city'],
            'industry_name': business['industry_name'],
            'subindustry_name': business['subindustry_name'],
            'description': business['description']
        })
df = pd.DataFrame(results)
print(df.to_string(index=False))

OUTPUT:

title                       city        industry_name         subindustry_name                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 description
                    GrabFood                     London   Ecommerce & Retail                Ecommerce                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                GrabFood is a same-day grocery delivery company, offering delivery in as little as one hour.
                  DuckDuckGo               Philadelphia      Software & Data                 Software                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         DuckDuckGo is a general search engine with:\n  --No tracking.\n  --Better instant answers.\n  --Way less spam and clutter.\n\nMore at https://duckduckgo.com/press/
                      Medium              San Francisco      Software & Data                     Apps                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   Medium is rethinking how ideas and storied are shared with the world. We believe: \n\n- Great ideas can come from anywhere\n- People create better things together\n- Design matters at a deep level\n\nWe also care deeply about how media shapes the lives of individuals and the decisions of society — and we think it can be better. \n\nWe have a world-class engineering and design team, which we are looking to grow slowly and deliberately. Let us know if you're interested. 
                  Eventbrite              San Francisco      Software & Data                     Apps                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Eventbrite brings people together around the power of live events. Founded in 2006, the innovative ticketing, registration, and event discovery platform has sold more than 140M tickets in 176 countries, and processed over $2B in gross ticket sales (25% of the in the last six months). We’re transforming the ticketing and registration industry from the ground up, and we're looking for amazing people to help us change the way people get together. 
...
  • Related