below is the html snippet
<P >
<b>
<a name="Editores"> Editorial </a>
"assistant"
</b>
</p>
by using this scrapy code
response.css("p.subtitulo *::text").extract()
I get
['Editorial', ' Assistant']
response.css("p.subtitulo *::text").get()
I get only "
Assistant
" I want the full string without any commas like
"Editorial Assistant"
Using Beautiful soup I am getting the text without comma. But how to do it with Scrapy. Since I have other roles separated by commas I don't want to use split().
This is the page url http://www.scielo.org.co/revistas/zop/iedboard.htm
CodePudding user response:
You can do that by invoking .join()
and .getall()
method as follows:
import scrapy
class TestSpider(scrapy.Spider):
name = 'test'
start_urls = ['http://www.scielo.org.co/revistas/zop/iedboard.htm']
def parse(self, response):
for p in response.css('.subtitulo')[1:]:
yield {
'Name': ''.join(p.css("::text").getall())
}
Output:
{'Name': 'Editorial Assistant'}
2022-08-08 15:39:03 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.scielo.org.co/revistas/zop/iedboard.htm>
{'Name': 'Editorial Committee '}
2022-08-08 15:39:03 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.scielo.org.co/revistas/zop/iedboard.htm>
{'Name': 'Scientific Committee'}
2022-08-08 15:39:03 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.scielo.org.co/revistas/zop/iedboard.htm>
{'Name': 'Editorial Universidad Del Norte'}