Select a group of elements and text using css selectors-CodePudding

I have an HTML page like:-

<div>
<a href='link'>
<u class>name</u>
</a>
text
<br>
<a href='link'>
<u class>name</u>
</a>
text
<br>
<a href='link'>
<u class>name</u>
</a>
text
<br>
<a href='link'>
<u class>name</u>
</a>
text
<br>
<a href='link'>
<u class>name</u>
</a>
text
<br>
</div>

I need to select a group like this:-

<a href='link'>
<u class>name</u>
</a>
text
<br>

I need to select 3 values from a group:- link, name, and text. Is there any way to select a group like this, and extract these particular values from each group in scrapy using, CSS selectors, Xpath, or anything?

CodePudding user response：

If its okay to wrap text in a span like so:

<a href='link'>
<u class>name</u>
</a>
<span>text</span>
<br>

Then you can select everything in CSS like so:

a, a span {}

Or you can style these two separately:

a {}
a span {}

The means "comes immediately after" or "is immediately followed by"

CodePudding user response：

Scrapy provides a mechanism to yield multiple values on the html page using Items- as items, Python objects that define key-value pairs.

You can extract individually and but yield them together as key-value pairs.

to extract value of an attribute of an element, use attr().
to extract innerhtml, use text.

Like you can define your parse function in scrapy like this:

def parse(self, response):
      
        for_link = response.css(' .row.no-gutters div:nth-child(3) div:nth-child(8)  a::attr(href)').getall()
            
        for_name = response.css(' .row.no-gutters div:nth-child(3) div:nth-child(8) a u::text').getall()
              
        for_text =  response.css(' .row.no-gutters div:nth-child(3) div:nth-child(8)::text').getall()
             
            # Yield all elements
            yield {"link": for_link, "name": for_name, "text": for_text}

Open the items.py file.

# Define here the models for your scraped
# items
# Import the required library
import scrapy
 
# Define the fields for Scrapy item here
# in class
class <yourspider>Item(scrapy.Item):
     
    # Item key for a
    for_link = scrapy.Field()
     
    # Item key for u
    for_name = scrapy.Field()
     
    # Item key for span
    for_text = scrapy.Field()

for more details, read this tutorial