I wanted to scrape the feed of sitepoint.com, this is my code:
import scrapy
from urllib.parse import urljoin
class SitepointSpider(scrapy.Spider):
# TODO: Add url tags (like /javascript) to the spider based on class paraneters
name = "sitepoint"
allowed_domains = ["sitepoint.com"]
start_urls = ["http://sitepoint.com/javascript/"]
def parse(self, response):
data = []
for article in response.css("article"):
title = article.css("a.t12xxw3g::text").get()
href = article.css("a.t12xxw3g::attr(href)").get()
img = article.css("img.f13hvvvv::attr(src)").get()
time = article.css("time::text").get()
url = urljoin("https://sitepoint.com", href)
text = scrapy.Request(url, callback=self.parse_article)
data.append(
{"title": title, "href": href, "img": img, "time": time, "text": text}
)
yield data
def parse_article(self, response):
text = response.xpath(
'//*[@id="main-content"]/article/div/div/div[1]/section/text()'
).extract()
yield text
And this is the response I get:-
[{'title': 'How to Build an MVP with React and Firebase',
'href': '/react-firebase-build-mvp/',
'img': 'https://uploads.sitepoint.com/wp-content/uploads/2021/09/1632802723react-firebase-mvp-
app.jpg',
'time': 'September 28, 2021',
'text': <GET https://sitepoint.com/react-firebase-build-mvp/>}]
It just does not scrape the urls. I followed everything said in this question but still could not make it work.
CodePudding user response:
You have to visit the detail page from the listing to scrape the article.
In that case you have to yield the URL first then yield the data in the last spider
Also, the //*[@id="main-content"]/article/div/div/div[1]/section/text()
won't return you any text since there are lots of HTML elements under the section
tag
One solution is you can scrape all the HTML element inside section
tag and clean them later to get your article text data
here is the full working code
import re
import scrapy
from urllib.parse import urljoin
class SitepointSpider(scrapy.Spider):
# TODO: Add url tags (like /javascript) to the spider based on class paraneters
name = "sitepoint"
allowed_domains = ["sitepoint.com"]
start_urls = ["http://sitepoint.com/javascript/"]
def clean_text(self, raw_html):
"""
:param raw_html: this will take raw html code
:return: text without html tags
"""
cleaner = re.compile('<.*?>|&([a-z0-9] |#[0-9]{1,6}|#x[0-9a-f]{1,6});')
return re.sub(cleaner, '', raw_html)
def parse(self, response):
for article in response.css("article"):
title = article.css("a.t12xxw3g::text").get()
href = article.css("a.t12xxw3g::attr(href)").get()
img = article.css("img.f13hvvvv::attr(src)").get()
time = article.css("time::text").get()
url = urljoin("https://sitepoint.com", href)
yield scrapy.Request(url, callback=self.parse_article, meta={"title": title,
"href": href,
"img": img,
"time": time})
def parse_article(self, response):
title = response.request.meta["title"]
href = response.request.meta["href"]
img = response.request.meta["img"]
time = response.request.meta["time"]
all_data = {}
article_html = response.xpath('//*[@id="main-content"]/article/div/div/div[1]/section').get()
all_data["title"] = title
all_data["href"] = href
all_data["img"] = img
all_data["time"] = time
all_data["text"] = self.clean_text(article_html)
yield all_data
CodePudding user response:
You can pull clean section text using xpath expression
import scrapy
from urllib.parse import urljoin
from scrapy.crawler import CrawlerProcess
class SitepointSpider(scrapy.Spider):
name = "sitepoint"
allowed_domains = ["sitepoint.com"]
start_urls = ["http://sitepoint.com/javascript/"]
def parse(self, response):
for article in response.css("article"):
title = article.css("a.t12xxw3g::text").get()
href = article.css("a.t12xxw3g::attr(href)").get()
img = article.css("img.f13hvvvv::attr(src)").get()
time = article.css("time::text").get()
url = urljoin("https://sitepoint.com", href)
yield scrapy.Request(url, callback=self.parse_article, meta={"title": title,
"href": href,
"img": img,
"time": time})
def parse_article(self, response):
title = response.request.meta["title"]
href = response.request.meta["href"]
img = response.request.meta["img"]
time = response.request.meta["time"]
all_data = {}
article_html = ''.join([x.strip() for x in response.xpath('//*[@id="main-content"]/article/div/div/div[1]/section//div//p//text()').getall()])
all_data["title"] = title
all_data["href"] = href
all_data["img"] = img
all_data["time"] = time
all_data["text"] = article_html
yield all_data
if __name__ == "__main__":
process =CrawlerProcess()
process.crawl(SitepointSpider)
process.start()
Output:
{'title': 'A Beginner’s Guide to the Parse Platform on Back4App', 'href': '/parse-platform-back4app-beginner-guide/', 'img': 'https://uploads.sitepoint.com/wp-content/uploads/2021/11/1636417748parse-back4app.jpg', 'time': 'November 09, 2021', 'text': 'These days, it seems like the future of software developers is bleak with the rise of no-code platforms. Fortunately, there’s a way to make ourselves more efficient today by leveraging our existing
skills to build new apps using low-code platforms. Unlike no-code, low-code platforms are
more flexible and offer greater customizable features. You can write custom code snippets
and install Node.js packages to give your app more advanced features.In this article, I’ll present a high-level overview ofBack4App, aBackend-as-a-Service(BaaS)platform that hosts
Parse applications for developers. BaaS platforms allow developers to quickly develop and
launch new back-end apps with minimum effort. They also eliminate the need to set up hosting and configuring autoscaling, which can be a time-consuming task for developers.TheParse platformis a popular, open-source framework for building application back ends. It runs on Node.js and is written to work with Express.js. Simply put, it’s like an open-source version of Firebase that you can run on your machine and host on your own server.The origins
of the project date back to 2011, whenParse Incwas founded to provide a back-end tool for
mobile developers. The startup raised $5.5 million in venture capital funding, which allowed it to grow its user base to 20,000 developers within a year.The company became so successful that it was acquired two years later by Facebook for $85 million. By 2014, the platform was hosting about 500,000 mobile apps. Unfortunately, Facebookfailed to investin the development of the platform and decided to shut down the service by January 2017. In order
to assist its customers, Facebook open-sourced the Parse platform so as to allow developers to migrate their apps to their own self-hosted Parse server.Since then, the open-source
community has continually worked on the project and has built a website, online documentation and community forum. Today, Parse provides a number of back-end features that include:The Parse platform is mainly made up of:Note that there are several Parse projects that I
haven’t mentioned here. For example, there are Android and IOS apps that provide front-end interfaces for Parse server.Parse server currently supports Mongo and PostgreSQL databases, which are the leading databases in the NoSQL and SQL spaces respectively. Both databases are quite capable, which makes it difficult to choose which one to go with.Thisdetailed
guidemay be of assistance. In my opinion, if you’re a beginner, MongoDB is a better choice, as it’s more flexible and has a shallower learning curve. If you’re an experienced SQL developer, you’d be more productive with PostgreSQL. Below is a quick comparison for each database.Pros:Cons:Previous issues like ACID compliance and JOINS are now officially supported in the latest versions of MongoDB.Pros:Cons:If you’re still confused about which one to use, fortunately Back4App has an answer for you.Back4App is a cackend-as-a-service company that hosts Parse server apps for developers at an affordable rate. It greatly simplifies the development of Parse apps. All you need to do is tosign upfor a free tier account (no credit card) to get started with 250MB of data storage and 25k requests.Paid plans offer larger resource quotas and more features such as backups, data recovery, CDN, auto scaling and high request performance. The free plan only is only recommended for learning, while the paid plans are capable of handling thousands of requests per second. See thefull pricing pagefor more details.Back4App allows you to create and manage multiple Parse apps on the same dashboard. This is a huge time saver compared to manually installing, configuring
and hosting each parse server yourself. The difference is minutes vs hours.Back4App uses Mongo for the database. However, it behaves as if it’s running PostgreSQL. This is great, since you get the advantages of SQL databases while using a non-SQL one — such as referential integrity, foreign key constraints and schema validation. This implementation is done in code and runs between the database and the dashboard.The database browser organizes tables (collections) as classes and data is laid out in a spreadsheet format. You can add/edit/delete/reorder columns, specify data types, and import/export data in CSV or JSON formats.The spreadsheet interface allows you to create and edit rows of data easily. You can also upload binary files such as images or PDFs into columns that have the File data type. This is another huge time saver, as you don’t need to configure a file storage service to handle binary data. With Parse, it’s already built-in and configurable to support external file storage services.Parse provides a built-in email/password authentication service. Users and roles are stored in the database and can be viewed and created via the database browser. Users can also be created programmatically via SDK, REST or GraphQL API endpoints.Here’s an example of a sign-up function implemented on the front end using the Parse JavaScript SDK:Back4App allows developers to enableemail verificationandpassword recoveryfeatures for their Parse apps. These are essential account management features that users expect when using any secure application.In addition to the default authentication method, you can enable your Parse app to authenticate using any of the following sign in methods:Authorization determines if an authenticated user has access to information stored on the database.
Permissions are defined with the use ofRolesandAccess Controls. There are two levels of access controls:Parse usesaccess control lists (ACL)to protect private data from being publicly accessible. However, if the user has some data that needs to be shared publicly, a second ACL needs to be created in order to grant public access. Do note that class-level permissions will always override ACL permissions.This is a new feature that allows storing data in a private Ethereum blockchain network. Blockchain differs from a traditional database in that, once records are inserted and verified, they can’t be updated or deleted. This has many practical implementations where trust between parties is critical in a business transaction.At the time of writing, thisfeatureis still in the alpha stage.Often when building user interfaces, you’ll need to populate certain input elements with data such as list
of countries, cities, zip codes, vehicle models, colors, and so on. Back4App solves this problem by providing theDatabase Hub, a list of public databases that you can freely access and use for your app.A dataset example of all the cities of the world is pictured below:There are three ways of accessing a public database:The last two methods allow you to modify the public datasets as you like.When building real-time applications, you may be forced
to fetch new data every one or so seconds in order to check if there’s been any new update. This technique is known aspolling, and it’s problematic, because it causes high network
and server usage. Imagine if your app is being used by tens of thousands of users.Parse has a built-in protocol known asLiveQuerythat allows clients to subscribe/unsubscribe to a LiveQuery server. When the relevant data is updated, the LiveQuery server pushes the new data to all clients that have subscribed to it.With Back4App,activatingthe LiveQuery server
is as simple as going to your App’sServer settings>Server URL and Live Queryand activating it.With front-end–heavy applications, a lot of data manipulation is done on the client device. Often this requires sending huge amounts of data so that the front-end code can process and use it to display a summary of the information. End users are likely to experience sluggishness using your app.Parse provides a built-in feature known asCloud Code Functionsthat allows all the heavy data lifting to be performed on the server itself. For example, if you want the average sale value of a specific product in the last year, you can simply retrieve all the necessary data within the server environment, perform the calculation and send the value to the front-end client.Performing such actions on the server is quicker, more efficient, and will result in a smoother experience for the end users. Another benefit of Parse’s Cloud Function is that it runs in a full Node.js environment, unlike AWS Lambda and Cloudflare Workers. This means you can install any Node.js package you want without having to resort to workarounds.Here are examples ofCloud Code Functionsthat run on your Parse Server app:Here’s how you can call Cloud functions from your frontend app:You can also implement advanced features with Cloud Code Functions, such assending SMS text messagesto any phone using theTwilioAPI:Other advanced examples of cloud functions you can implement in your Parse Server app include accepting credit card payments via theStripeAPI and sending emails via theSendGridAPI.Triggersare cloud functions that allow you to implement custom logic such as formatting or validation before and after an event. Take a look at the validation code example below:In this example above, the validation code ensures that users can’t give less than a one- or more than five-star rating in a review. Otherwise, the client will receive an error. Parse currently supports the following types of triggers:With Cloud Code, you can ensure the same behavior for all the client apps that you support — such as web, Android, iOS, and so on.Cloud jobsare simply long-running functions where you
don’t expect a response. Examples include batch processing a large set of images, or web scraping. You can also use cloud jobs to perform tasks such removing inactive users that haven’t verified their emails.Do note Parse server doesn’t provide scheduling. Fortunately,
Back4App does — through a feature known as theCron Job. You simply write a cloud function
in this format:Next, you upload the cron job code to your app, and then you use theBackground jobsfeature to schedule when your code should run.You can further extend the capabilities for your Parse server app by installingNode.js packagesandParse Adapters. The image below shows some of the adapters maintained by the core Parse community.Adapters are simply
Node.js packages that can be installed by uploading apackage.jsonfile to your Cloud Functions dashboard. An example of an adapter is theparse-server-sqs-mq-adapterwhich enables integration with of a Parse Server app with Amazon Simple Queue Service.Unfortunately, many of the community-contributed adapters and modules have been deprecated or aren’t being actively maintained. So you’ll probably need to use an officially supported npm package and write custom code in order to ensure your code is secure by using the latest dependencies.If you use anyconsole.logorconsole.errorfunctions in your Cloud Code, they’ll be displayed in theCloud Code>Logsdashboard, as pictured below.Logs can be viewed in the following categories:Event logging is an important aspect of running production apps, as it can help you
understand requests and discover bugs in your code.Back4App providesAnalyticsreporting tools — which is a bonus feature, since the open-source Parse Server only supports capturing
of data but not reporting. Back4App’s Analytics reporting tool helps in providing real-time information about your app such as growth, conversion, performance andusage behavior.The tool comes with a set of pre-defined tracking reports which include:The image below shows an example of a Performance report.You can also define your owncustom events report, which will allow you to track any event via the Parse SDK. See the following example code implemented on the client side via Parse SDK:The above code captures data and sends it to the
Parse server. This data can later be queried and used to build a custom events report.Parse supports every major front-end framework and language through itsSDK libraries, including these:Unsupported programming languages can use theRESTandGraphQLAPIs to interact with data on a Parse Server. To use theParse JavaScript SDKin a browser environment, you’ll need to install the followingnpm library:Then import it like so:The library directly interacts with the Parse Server by providing developers with a set of functions that they can execute. These functions can handle operations such as:Below are examples of CRUD operations using the Parse SDK in JavaScript:The majority of low-code and no-code platforms allow you to build specific solutions very quickly with no coding experience. Unfortunately, these platforms often lock you in and have limited capabilities. Parse and Back4App fortunately provides experienced developers with all the customization they need and the freedom to host with any cloud provider.Some of additional features Back4App provides that haven’t been mentioned include:To conclude, I’ll leave you with this question. How would you prefer building your next back-end application?'}
...so on