Home > Back-end >  How to scrape the whole information while using xpath selector
How to scrape the whole information while using xpath selector

Time:06-25

I encountered a problem where I could not get all the information while using the XPath selector. The line is in developer mode. Is this

<address  data-qa-target="provider-office-address">
230 W 13th St Ste 1b<!-- 
--> <!-- 
-->New York<!-- 
-->, <!--
-->NY<!-- 
--> <!-- 
-->10011<!--
--> 
</address>

The XPath selector that I use is

response.xpath('//*[@id="summary-section"]/div[1]/div[2]/div/div/div[2]/div[1]/address/text()').get()

The result I am getting is

230 W 13th St Ste 1b

The result I am expecting is

230 W 13th St Ste 1b New York, NY 10011

I am using scrapy for scraping. Thank you. Your help is appreciated.

Edit: The above problem I was facing was solved. I used the string() method and get() to get all the strings from the element node.

response.xpath('string(//*[@id="summary-section"]/div[1]/div[2]/div/div/div[2]/div[1]/address)').get()

CodePudding user response:

Your XPath expression returns all the text nodes which are children of the address element. There are several text nodes, with comment nodes separating them!

Back in Python land, you are calling the get() method on the result which returns you only the first node of the nodeset.

.get() always returns a single result; if there are several matches, content of a first match is returned; if there are no matches, None is returned. .getall() returns a list with all results. https://docs.scrapy.org/en/latest/topics/selectors.html

If you called the getall() method you would retrieve a list of strings, and you could concatenate them to produce the text you want. But a simpler method is to use the XPath function string to get the "string value" of the address element. In the XPath 1.0 spec it defines the string value of an element node this way:

The string-value of an element node is the concatenation of the string-values of all text node descendants of the element node in document order.
https://www.w3.org/TR/1999/REC-xpath-19991116/#element-nodes

Applying this function to the address element will return you a single string value, which you can then access using the get() method in Scrapy:

response.xpath(
   'string(//*[@id="summary-section"]/div[1]/div[2]/div/div/div[2]/div[1]/address)'
).get()

CodePudding user response:

Just remove the /text() from your XPath expression, because it selects only text nodes. And in this case, only the first one.

Omitting it will select all descendants(including all text nodes):

response.xpath('//*[@id="summary-section"]/div[1]/div[2]/div/div/div[2]/div[1]/address').get()

The output should be as desired.
If that doesn't help, use the string() function to get the whole string from all descendants.

  • Related