I have an HTML page like:-
<div>
<a href='link'>
<u class>name</u>
</a>
text
<br>
<a href='link'>
<u class>name</u>
</a>
text
<br>
<a href='link'>
<u class>name</u>
</a>
text
<br>
<a href='link'>
<u class>name</u>
</a>
text
<br>
<a href='link'>
<u class>name</u>
</a>
text
<br>
</div>
I need to select a group like this:-
<a href='link'>
<u class>name</u>
</a>
text
<br>
I need to select 3 values from a group:- link, name, and text. Is there any way to select a group like this, and extract these particular values from each group in scrapy using, CSS selectors, Xpath, or anything?
CodePudding user response:
If its okay to wrap text in a span like so:
<a href='link'>
<u class>name</u>
</a>
<span>text</span>
<br>
Then you can select everything in CSS like so:
a, a span {}
Or you can style these two separately:
a {}
a span {}
The
means "comes immediately after" or "is immediately followed by"
CodePudding user response:
Scrapy provides a mechanism to yield
multiple values on the html page using Items
- as items, Python objects that define key-value pairs.
You can extract individually and but yield them together as key-value pairs.
- to extract value of an attribute of an element, use attr().
- to extract innerhtml, use text.
Like you can define your parse function in scrapy like this:
def parse(self, response):
for_link = response.css(' .row.no-gutters div:nth-child(3) div:nth-child(8) a::attr(href)').getall()
for_name = response.css(' .row.no-gutters div:nth-child(3) div:nth-child(8) a u::text').getall()
for_text = response.css(' .row.no-gutters div:nth-child(3) div:nth-child(8)::text').getall()
# Yield all elements
yield {"link": for_link, "name": for_name, "text": for_text}
Open the items.py file.
# Define here the models for your scraped
# items
# Import the required library
import scrapy
# Define the fields for Scrapy item here
# in class
class <yourspider>Item(scrapy.Item):
# Item key for a
for_link = scrapy.Field()
# Item key for u
for_name = scrapy.Field()
# Item key for span
for_text = scrapy.Field()
for more details, read this tutorial