Why a certain section of a website is not scraped using python with either scrapy or bs4-CodePudding

I am trying to scrape the following site: https://oxolabs.eu/#portfolio

The info I am looking to scrape is the company URL's form the portfolio section. I have tried first with Scrapy but it returns this (website is crawled but not scraped):

2022-07-28 11:46:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://oxolabs.eu/?status=funded#portfolio> (referer: None) 2022-07-28 11:46:03 [scrapy.core.engine] INFO: Closing spider (finished)

Beautifulsoup returned every URL besides the ones in the portfolio section.

Can someone explain why that section is not being scraped, and how I could scrape it?

My beautifulsoup script:

from cgitb import text
from re import A
from bs4 import BeautifulSoup
import requests

url = "https://oxolabs.eu/?status=funded#portfolio"
ua={'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'}
r = requests.get(url, headers=ua, verify=False)
soup = BeautifulSoup(r.text, features="lxml")

for link in soup.find_all('a'):
    print(link.get('href'))

I have also attached the script I used with Scrapy:

import scrapy


class StupsbSpider(scrapy.Spider):
    name = 'stupsb'
    allowed_domains = ['oxolabs.eu/']
    start_urls = ['https://oxolabs.eu/?status=funded#portfolio']

    def parse(self, response):
        startups = response.xpath("//section[@class='oxo-section oxo-portfolio']")
        for startup in startups:
            # name = startup.xpath(".//a[@class='portfolio-entry-media-link']/@title").getall(),
            # industry = startup.xpath(".//div[@class='text-block-6']//text()").get(),
            url = startup.xpath("//section[@class='oxo-section oxo-portfolio']//@href").getall()
            yield{
                'url' : url,
            }

CodePudding user response：

The data you require is being loaded from an API dynamically using javascipt and you are just trying to obtain links which are not yet loaded into the DOM. If you wish to scrape this data then I would look into using Selenium as a headless scraper.

If it were me, sometimes you don't need to get the data by scraping. Why don't you just use requests on this link:

https://api.oxoservices.eu/api/v1/startups?site=labs&startup_status=funded

You can then tweak the query string startup_status to be either funded, accelerating or exited. The data you're looking for comes across formatted, no restrictions and you can use that to get the image or other data you require from that JSON payload.

As an example to get you started:

import json
import requests

resp = requests.get('https://api.oxoservices.eu/api/v1/startups?site=labs&startup_status=funded')

json_resp = json.loads(resp.text)

for company in json_resp['data']:
    print(json.dumps(company, indent=4))
    print()

This will give you a list of startups which each company looking like this:

{
    "id": 1047,
    "name": "Betme",
    "photo": {
        "id": "d800cf0b-7772-4f9a-a7fc-3563976aa292",
        "filename": "6f85f02d55c7db098a2cd141bf2b4c60.png",
        "mime": "image/png",
        "type": "photo",
        "size": 47951,
        "url": "/attachments/d800cf0b-7772-4f9a-a7fc-3563976aa292",
        "created_at": "2021-04-01T03:46:01.000000Z"
    },
    "photo_id": "d800cf0b-7772-4f9a-a7fc-3563976aa292",
    "cover": null,
    "cover_id": null,
    "focus_id": 25,
    "focus": {
        "id": 25,
        "name": "E-Sport/E-Gaming",
        "color": "rgb(138, 102, 73)",
        "is_active": true,
        "created_at": "2019-09-23T16:50:43.000000Z",
        "updated_at": null
    },
    "startup_stage_id": 1,
    "website": "https://www.betmegaming.com",
    "video_id": null,
    "summary": "A Betme egy applik\u00e1ci\u00f3 form\u00e1j\u00e1ban \u00faj\u00edtja meg az e-gaming vil\u00e1g\u00e1t. K\u00f6z\u00f6ss\u00e9gi megold\u00e1sainak k\u00f6sz\u00f6nhet\u0151en a j\u00e1t\u00e9kosok p\u00e9nzkereseti lehet\u0151s\u00e9ghez jutnak.",
    "video_type_id": "1",
    "startup_status": {
        "id": 5,
        "key": "funded",
        "name": "Funded"
    },
    "startup_investment_type": {
        "id": 3,
        "key": "seed",
        "name": "Seed"
    },
    "startup_valuation_basis": null,
    "raised_type": {
        "id": 1,
        "key": "none",
        "name": "Not seeking"
    },
    "is_active": false,
    "irr": 0,
    "created_at": "2020-07-23T16:40:13.000000Z"
}

Usually using data like this is a far more efficient and simple way to get what you want as it's already in a structured format.

CodePudding user response：

Only the images are loaded by JavaScript but rest of the desired data are static.

Example:

import scrapy
class StupsbSpider(scrapy.Spider):
    name = 'stupsb'
    start_urls = ['https://oxolabs.eu/?status=funded#portfolio']

    def parse(self, response):
        startups = response.xpath('//*[@]/a/@href')
        for startup in startups:
            yield{
                'url' : startup.get()
            }

Output:

{'url': 'https://www.linkedin.com/pub/peter-oszkó/25/705/3b3'}
2022-07-28 16:37:02 [scrapy.core.scraper] DEBUG: Scraped from <200 https://oxolabs.eu/?status=funded>
{'url': 'https://www.linkedin.com/in/rita-jánoska/'}
2022-07-28 16:37:02 [scrapy.core.scraper] DEBUG: Scraped from <200 https://oxolabs.eu/?status=funded>
{'url': 'https://www.linkedin.com/in/gergely-balogh-1bbb3573/'}
2022-07-28 16:37:02 [scrapy.core.scraper] DEBUG: Scraped from <200 https://oxolabs.eu/?status=funded>
{'url': 'https://www.linkedin.com/in/orsolya-csetri-940b5721/'}
2022-07-28 16:37:02 [scrapy.core.scraper] DEBUG: Scraped from <200 https://oxolabs.eu/?status=funded>
{'url': ''}
2022-07-28 16:37:02 [scrapy.core.scraper] DEBUG: Scraped from <200 https://oxolabs.eu/?status=funded>
{'url': 'https://www.linkedin.com/in/marai-mónika-klaudia-048973193/'}
2022-07-28 16:37:02 [scrapy.core.engine] INFO: Closing spider (finished)
2022-07-28 16:37:02 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
 'downloader/response_status_count/200': 1,
 'item_scraped_count': 6,