I'm trying to scrape a webpage, which have an unknown amount of < p> tags, in between a known div class.. Some pages have only 1 < p> tag, while others have 10 or even more.. How can I extract them all? Preferable inside one variable, so I can store them inside a csv like all the other data's I'm scraping :)
The HTML structure is as in the following example:
<div >
<h2 >title text</h2>
<p> </p>
<p>text text text...</p>
<p>text text text...</p>
<p> </p>
<p><br>text text text...</p>
<p> </p>
<p><br>text text text...</p>
<p> </p>
<p><br>text text text...</p>
<p> </p>
<p><br>text text text...</p>
<p> </p>
<p>text text text...</p>
<p> </p>
<p><br>text text text...</p>
<p> </p>
<p><br>text text text...</p>
<p> </p>
<p><br>text text text...</p>
<p>text text text...</p>
<p>text text text...</p>
</div>
I'm using python and scrapy framework to achieve this.
Currently I have:
divs = response.xpath('/html/body/div[6]/div/section[2]/article/div/div/div')
for p in divs.xpath('.//p'): # extracts all <p> inside
print(p.get())
story = p
yield {
'story': story
}
It does print all the text values for the various < p> tags, but when stored to the csv file, only the last < p> is inserted to the *.csv.
To store the scraped data into *.csv, I have the following inside my settings.py:
# Deph of Crawler
DEPTH_LIMIT = 0 # 0 = Infinite depth
# Feed Export Settings
FEED_FORMAT="csv"
FEED_URI="output_%(name)s.csv"
and the yield part above, are the fields going into the *.csv.
Kindest regards,
CodePudding user response:
You could do it in one line, really:
story = ' '.join([x.get().strip() for x in response.xpath('//div[6]/div/section[2]/article/div/div/div//p')])
If you would confirm the page url, I would probably be able to improve that long, fragile XPATH. Nonetheless, the above should work.
Scrapy documentation can be found at https://docs.scrapy.org/en/latest/
CodePudding user response:
You have to store the text of all the p tags and then join them using space or line-break or whatever you want and then assign to story variable.
divs = response.xpath('/html/body/div[6]/div/section[2]/article/div/div/div')
arr = [] # this will store the text of all p tags
for p in divs.xpath('.//p'): # extracts all <p> inside
print(p.get())
arr.append(p.get())
story = '\n'.join(arr)
yield {
'story': story
}