Home > Enterprise >  How to scrape a various amount of <p> in between a <div> class
How to scrape a various amount of <p> in between a <div> class

Time:10-17

I'm trying to scrape a webpage, which have an unknown amount of < p> tags, in between a known div class.. Some pages have only 1 < p> tag, while others have 10 or even more.. How can I extract them all? Preferable inside one variable, so I can store them inside a csv like all the other data's I'm scraping :)

The HTML structure is as in the following example:

<div >
    <h2 >title text</h2>
    <p>&nbsp;</p>
    <p>text text text...</p>
    <p>text text text...</p>
    <p>&nbsp;</p>
    <p><br>text text text...</p>
    <p>&nbsp;</p>
    <p><br>text text text...</p>
    <p>&nbsp;</p>
    <p><br>text text text...</p>
    <p>&nbsp;</p>
    <p><br>text text text...</p>
    <p>&nbsp;</p>
    <p>text text text...</p>
    <p>&nbsp;</p>
    <p><br>text text text...</p>
    <p>&nbsp;</p>
    <p><br>text text text...</p>
    <p>&nbsp;</p>
    <p><br>text text text...</p>
    <p>text text text...</p>
    <p>text text text...</p>
</div>

I'm using python and scrapy framework to achieve this.

Currently I have:

divs = response.xpath('/html/body/div[6]/div/section[2]/article/div/div/div')
for p in divs.xpath('.//p'):  # extracts all <p> inside
        print(p.get())
story = p

yield {
    'story': story
    }

It does print all the text values for the various < p> tags, but when stored to the csv file, only the last < p> is inserted to the *.csv.

To store the scraped data into *.csv, I have the following inside my settings.py:

# Deph of Crawler
DEPTH_LIMIT = 0 # 0 = Infinite depth

# Feed Export Settings
FEED_FORMAT="csv"
FEED_URI="output_%(name)s.csv"

and the yield part above, are the fields going into the *.csv.

Kindest regards,

CodePudding user response:

You could do it in one line, really:

story = ' '.join([x.get().strip() for x in response.xpath('//div[6]/div/section[2]/article/div/div/div//p')])

If you would confirm the page url, I would probably be able to improve that long, fragile XPATH. Nonetheless, the above should work.

Scrapy documentation can be found at https://docs.scrapy.org/en/latest/

CodePudding user response:

You have to store the text of all the p tags and then join them using space or line-break or whatever you want and then assign to story variable.

divs = response.xpath('/html/body/div[6]/div/section[2]/article/div/div/div')
arr = [] # this will store the text of all p tags 
for p in divs.xpath('.//p'):  # extracts all <p> inside
        print(p.get())
        arr.append(p.get()) 
story = '\n'.join(arr)

yield {
    'story': story
    }
  • Related