I don't understand why the following code doesn't work when using Scrapy Selector.
In scrapy shell (to be easily replicable, but the issue remains the same in a spider):
from scrapy.selector import Selector
body = '''<html>
<body>
<li>
<p>1</p>
<p>2</p>
<p>3</p>
</li>
<li>
<p>4</p>
<p>5</p>
<p>6</p>
</li>
<li>
<p>7</p>
<p>8</p>
<p>9</p>
</li>
</body>
</html>'''
sel = Selector(text=body, type="html")
for elem in sel.xpath('//body'):
first = elem.xpath('.//li/p[1]/text()').get()
print(first)
And it prints:
1
while it should be printing:
1
4
7
Any idea on how to solve this problem ?
Thanks
CodePudding user response:
There maybe a chance that you're using the .get() method to fetch the data which you can replace with .getall(). This method will give you all the data in list format through which you can get your desired data with help of python slicing.
Or in other way there maybe a change that class name is differ in each "li" tag or you may have to use pass the in your xpath URL.
Note: Rather then fetching data with path: "elem.xpath('.//li/p[1]/text()').get()" you can simply get all data by using "elem.xpath('.//li/p/text()').getall()" and then you can put the manipulation logic over the list data which is the easiest way if you don't get your desired output.
CodePudding user response:
To grab 1 4 7
You have to hold up to p1
inside the list array, and after that You could take the iterate technique that's the second array of the element locator strategy to get the desired text nodes value .
Try:
for p in sel.xpath('//body/li/p[1]'):
first = p.xpath('.//text()').get()
Proven by scrapy shell:
In [3]: %paste
from scrapy.selector import Selector
body = '''<html>
<body>
<li>
<p>1</p>
<p>2</p>
<p>3</p>
</li>
<li>
<p>4</p>
<p>5</p>
<p>6</p>
</li>
<li>
<p>7</p>
<p>8</p>
<p>9</p>
</li>
</body>
</html>'''
## -- End pasted text --
In [4]: sel = Selector(text=body)
In [5]: for p in sel.xpath('//body/li/p[1]'):
...: first = p.xpath('.//text()').get()
...: print(first)
...:
Output:
1
4
7