Hello everybody out there! I have been working with BeautifulSoup for my scraping projects. Currently, I'm learning Scrapy. I have written a code in BeautifulSoup to loop over multiple pages of a single website using for loops. I looped over 10 pages and fetched URLs of blog posts from those pages using the code below. I want to do the same thing in Scrapy but can't figure out how. Can the same approach (code) be used with scrapy to do the same thing? Here is the BeautifulSoup code:
URL = 'https://www.brookings.edu/topic/environment/page/'
lis=[]
for page in range(1,10):
req = requests.get(URL str(page) '/?type=posts')
soup = BeautifulSoup(req.text,'lxml')
links = [link['href'] for link in soup.find_all('a',
href=re.compile('^(https://www.brookings.edu/blog/)'))]
links=list(set(links))
lis.append(links)
This piece of code fetched the links from 10 pages of the website. I stored these links (blog posts links) in the list named li outside the for loop. Then with another for loop on that finalList I wrote my code to extract the text from blog posts.
CodePudding user response:
import scrapy
class BrSpider(scrapy.Spider):
name = 'br'
allowed_domains = ['brookings.edu']
def start_requests(self):
for page in range(1, 11):
yield scrapy.Request(f'https://www.brookings.edu/topic/environment/page/{page}/?type=posts', callback=self.parse)
def parse(self, response):
for i in response.css('.title a::attr(href)'):
yield {
'Link': i.get()
}
Output:
[
{"Link": "https://www.brookings.edu/blog/up-front/2018/12/10/3-big-societal-problems-to-fix-in-2019/"},
{"Link": "https://www.brookings.edu/blog/brookings-now/2018/10/31/highlights-an-energy-industry-view-on-moving-toward-a-lower-carbon-future/"},
{"Link": "https://www.brookings.edu/blog/brown-center-chalkboard/2018/10/30/climate-confusion-content-and-strategies-not-controversy-are-the-biggest-challenges-for-science-teachers/"},
{"Link": "https://www.brookings.edu/blog/fixgov/2018/10/25/the-economics-and-politics-of-carbon-pricing/"},
{"Link": "https://www.brookings.edu/blog/planetpolicy/2018/10/11/climate-reality-requires-starting-at-home-weaning-from-fossil-fuels/"},
{"Link": "https://www.brookings.edu/blog/techtank/2018/10/05/sharing-digitized-dna-sequences-must-balance-scientific-progress-with-fair-use/"},
{"Link": "https://www.brookings.edu/blog/africa-in-focus/2018/07/21/africa-in-the-news-eac-trade-statistics-nelson-mandela-day-and-conservation-updates/"},
{"Link": "https://www.brookings.edu/blog/future-development/2018/07/06/the-sustainable-development-goals-and-climate-finance-catalytic-agent-or-empty-vessel/"},
{"Link": "https://www.brookings.edu/blog/africa-in-focus/2018/06/14/enhancing-the-attractiveness-of-private-investment-in-hydropower-in-africa/"},
{"Link": "https://www.brookings.edu/blog/planetpolicy/2018/06/01/trump-tried-to-kill-the-paris-agreement-but-the-effect-has-been-the-opposite/"},
{"Link": "https://www.brookings.edu/blog/up-front/2021/02/23/transforming-natural-resource-governance-break-silos-sharpen-politics/"},
{"Link": "https://www.brookings.edu/blog/future-development/2021/02/09/its-critical-that-we-invest-in-better-global-weather-and-climate-observations/"},
{"Link": "https://www.brookings.edu/blog/future-development/2021/02/04/secular-stagnation-climate-action-and-the-natural-rate-of-interest/"},
{"Link": "https://www.brookings.edu/blog/africa-in-focus/2021/01/27/figures-of-the-week-carbon-taxes-can-fuel-green-economic-recovery-and-reduce-income-inequality/"},
{"Link": "https://www.brookings.edu/blog/future-development/2021/01/25/to-support-climate-action-growth-measures-should-count-planetary-damages/"},
{"Link": "https://www.brookings.edu/blog/order-from-chaos/2021/01/25/the-national-security-imperative-to-tackle-illegal-unreported-and-unregulated-fishing/"},
{"Link": "https://www.brookings.edu/blog/up-front/2021/01/15/time-to-pivot-the-role-of-the-energy-transition-and-investors-in-forging-resilient-resource-rich-country-outcomes/"},
{"Link": "https://www.brookings.edu/blog/education-plus-development/2019/12/10/national-climate-strategies-are-forgetting-about-girls-children-and-youth/"},
{"Link": "https://www.brookings.edu/blog/africa-in-focus/2021/05/15/africa-in-the-news-wildlife-horn-of-africa-and-infrastructure-updates/"},
{"Link": "https://www.brookings.edu/blog/planetpolicy/2021/05/10/barriers-to-achieving-us-climate-goals-are-more-political-than-technical/"},
{"Link": "https://www.brookings.edu/blog/up-front/2020/12/15/the-trump-administrations-major-environmental-deregulations/"},
{"Link": "https://www.brookings.edu/blog/up-front/2021/01/14/disrupting-the-waste-management-industry-through-technology-insights-from-rubicon/"},
{"Link": "https://www.brookings.edu/blog/fixgov/2020/12/28/who-is-and-isnt-represented-in-environmental-oversight-in-congress/"},
{"Link": "https://www.brookings.edu/blog/up-front/2020/12/17/regulating-autonomous-vehicles-and-ridesharing-lessons-from-california/"},
{"Link": "https://www.brookings.edu/blog/planetpolicy/2019/12/09/building-an-ambitious-us-climate-policy-from-the-bottom-up/"},
{"Link": "https://www.brookings.edu/blog/future-development/2019/12/02/top-emitters-must-commit-to-a-u-turn-at-cop25/"},
{"Link": "https://www.brookings.edu/blog/brown-center-chalkboard/2019/11/20/how-exposure-to-pollution-affects-educational-outcomes-and-inequality/"},
{"Link": "https://www.brookings.edu/blog/africa-in-focus/2019/10/03/a-conversation-with-guinean-president-conde-on-natural-resource-management-in-africa/"},
{"Link": "https://www.brookings.edu/blog/the-avenue/2019/09/24/how-a-scrappy-federal-it-program-can-be-a-model-for-us-climate-action/"},
{"Link": "https://www.brookings.edu/blog/planetpolicy/2019/09/17/success-from-the-un-climate-summit-will-hinge-on-new-ways-to-build-national-action/"},
{"Link": "https://www.brookings.edu/blog/future-development/2019/09/16/the-invisible-water-crisis/"},
{"Link": "https://www.brookings.edu/blog/planetpolicy/2021/05/10/republicans-in-congress-are-out-of-step-with-the-american-public-on-climate/"},
{"Link": "https://www.brookings.edu/blog/africa-in-focus/2021/05/05/figures-of-the-week-africas-renewable-energy-potential/"},
{"Link": "https://www.brookings.edu/blog/order-from-chaos/2021/04/26/will-cannabis-legalization-reduce-crime-in-mexico-has-it-in-the-us/"},
{"Link": "https://www.brookings.edu/blog/africa-in-focus/2021/04/10/africa-in-the-news-updates-on-natural-resources-and-politics-in-niger-djibouti-benin-and-chad/"},
{"Link": "https://www.brookings.edu/blog/africa-in-focus/2021/03/26/africas-green-bond-market-trails-behind-other-regions/"},
{"Link": "https://www.brookings.edu/blog/up-front/2021/03/04/understanding-and-mitigating-climate-change-risks/"},
{"Link": "https://www.brookings.edu/blog/up-front/2020/12/09/business-as-usual-is-not-an-option-the-future-of-natural-resource-governance/"},
{"Link": "https://www.brookings.edu/blog/future-development/2020/11/25/delhi-the-worlds-most-air-polluted-capital-fights-back/"},
{"Link": "https://www.brookings.edu/blog/planetpolicy/2020/11/23/around-the-halls-what-should-the-biden-administration-prioritize-in-its-climate-policy/"},
{"Link": "https://www.brookings.edu/blog/future-development/2020/11/19/to-ride-covid-19s-green-wave-governments-must-slash-fossil-fuel-subsidies/"},
{"Link": "https://www.brookings.edu/blog/africa-in-focus/2020/11/12/figure-of-the-week-africas-used-vehicle-market-and-the-environment/"},
{"Link": "https://www.brookings.edu/blog/future-development/2019/08/23/for-growth-and-well-being-climate-crisis-overshadows-all-else/"},
{"Link": "https://www.brookings.edu/blog/fixgov/2020/06/28/oil-gas-and-mining-corruption-is-it-inevitable/"},
{"Link": "https://www.brookings.edu/blog/order-from-chaos/2020/09/28/global-warming-fires-and-crime-in-mexico-and-beyond/"},
{"Link": "https://www.brookings.edu/blog/planetpolicy/2019/09/16/the-fight-to-contain-climate-change-implementing-paris-mobilizing-action/"},
{"Link": "https://www.brookings.edu/blog/planetpolicy/2019/09/13/campaign-2020-what-candidates-are-saying-on-climate-change/"},
{"Link": "https://www.brookings.edu/blog/planetpolicy/2018/06/01/one-year-since-trumps-withdrawal-from-the-paris-climate-agreement/"},
{"Link": "https://www.brookings.edu/blog/africa-in-focus/2021/03/01/recipe-for-a-green-recovery-carbon-taxes/"},
{"Link": "https://www.brookings.edu/blog/up-front/2021/02/25/seizing-opportunities-for-fuel-subsidy-reform/"},
{"Link": "https://www.brookings.edu/blog/brown-center-chalkboard/2020/10/28/the-importance-of-clean-air-in-classrooms-during-the-pandemic-and-beyond/"},
{"Link": "https://www.brookings.edu/blog/order-from-chaos/2020/10/26/not-dried-up-us-mexico-water-cooperation/"},
{"Link": "https://www.brookings.edu/blog/order-from-chaos/2020/10/12/saving-the-vaquita-marina-and-the-urgency-of-this-fall/"},
{"Link": "https://www.brookings.edu/blog/up-front/2020/10/06/using-extractive-industries-data-for-better-governance/"},
{"Link": "https://www.brookings.edu/blog/future-development/2019/07/10/to-save-forests-think-beyond-the-trees/"},
{"Link": "https://www.brookings.edu/blog/fixgov/2019/07/08/the-politics-of-methane/"},
{"Link": "https://www.brookings.edu/blog/africa-in-focus/2019/06/08/africa-in-the-news-new-environmental-policies-on-the-continent-zimbabwes-imf-stabilization-program-and-sudan-update/"},
{"Link": "https://www.brookings.edu/blog/up-front/2019/05/17/india-2024-a-green-india/"},
{"Link": "https://www.brookings.edu/blog/future-development/2019/04/25/the-critical-frontier-reducing-emissions-from-chinas-belt-and-road/"},
{"Link": "https://www.brookings.edu/blog/future-development/2019/04/24/new-data-on-governance-of-national-oil-companies-why-transparency-and-oversight-matter/"},
{"Link": "https://www.brookings.edu/blog/education-plus-development/2019/03/28/why-captain-planet-should-have-been-a-woman/"},
{"Link": "https://www.brookings.edu/blog/techtank/2021/11/16/how-technology-can-help-with-methane-regulation/"},
{"Link": "https://www.brookings.edu/blog/order-from-chaos/2020/06/16/reopening-the-world-to-prevent-zoogenic-pandemics-regulate-wildlife-trade-and-food-production/"},
{"Link": "https://www.brookings.edu/blog/order-from-chaos/2020/06/08/play-the-game-a-presidents-climate-quandary/"},
{"Link": "https://www.brookings.edu/blog/order-from-chaos/2020/05/11/wildlife-trade-in-mexico-conservation-and-pandemics/"},
{"Link": "https://www.brookings.edu/blog/up-front/2020/04/30/six-covid-related-deregulations-to-watch/"},
{"Link": "https://www.brookings.edu/blog/planetpolicy/2020/04/23/covid-19-and-climate-your-questions-our-answers/"},
{"Link": "https://www.brookings.edu/blog/future-development/2020/04/22/global-solutions-to-global-bads-2-practical-proposals-to-help-developing-countries-deal-with-the-covid-19-pandemic/"},
{"Link": "https://www.brookings.edu/blog/order-from-chaos/2020/09/14/illegal-fishing-in-mexico-and-policy-responses/"},
{"Link": "https://www.brookings.edu/blog/africa-in-focus/2020/09/05/africa-in-the-news-mali-coup-mauritius-oil-spill-and-covid-19-updates/"},
{"Link": "https://www.brookings.edu/blog/planetpolicy/2020/08/19/amid-covid-19-dont-ignore-the-links-between-poor-air-quality-and-public-health/"},
{"Link": "https://www.brookings.edu/blog/up-front/2020/08/07/uncommon-ground-the-impact-of-natural-resource-corruption-on-indigenous-peoples/"},
{"Link": "https://www.brookings.edu/blog/africa-in-focus/2020/08/05/the-controversy-over-the-grand-ethiopian-renaissance-dam/"},
{"Link": "https://www.brookings.edu/blog/techtank/2018/05/31/catastrophic-risk-to-ecosystems-puts-biotechnology-fixes-on-the-table/"},
{"Link": "https://www.brookings.edu/blog/planetpolicy/2018/05/29/transition-to-electric-vehicles-in-karnataka-and-india-whats-real-possible-and-missing-in-the-ecosystem/"},
{"Link": "https://www.brookings.edu/blog/fixgov/2018/05/16/young-republicans-diverge-on-climate-policy/"},
{"Link": "https://www.brookings.edu/blog/africa-in-focus/2018/05/10/figures-of-the-week-access-to-affordable-sustainable-and-modern-energy-in-africa/"},
{"Link": "https://www.brookings.edu/blog/social-mobility-memos/2018/04/21/earth-day-it-is-about-equity-as-well-as-the-environment/"},
{"Link": "https://www.brookings.edu/blog/brookings-now/2018/04/20/on-earth-day-5-facts-about-environmental-policy-and-research/"},
{"Link": "https://www.brookings.edu/blog/future-development/2019/01/28/the-deforestation-risks-of-chinas-belt-and-road-initiative/"},
{"Link": "https://www.brookings.edu/blog/fixgov/2018/12/20/what-frances-yellow-vest-protests-reveal-about-the-future-of-climate-action/"},
{"Link": "https://www.brookings.edu/blog/planetpolicy/2021/11/10/infrastructure-in-the-developing-world-is-a-planetary-furnace-heres-how-to-cool-it/"},
{"Link": "https://www.brookings.edu/blog/planetpolicy/2021/10/25/net-zero-carbon-pledges-have-good-intentions-but-they-are-not-enough/"},
{"Link": "https://www.brookings.edu/blog/future-development/2021/09/28/the-risks-of-us-eu-divergence-on-corporate-sustainability-disclosure/"},
{"Link": "https://www.brookings.edu/blog/order-from-chaos/2021/07/26/a-porpoise-to-serve-rescuing-the-vaquita-and-the-us-mexico-relationship/"},
{"Link": "https://www.brookings.edu/blog/africa-in-focus/2021/07/23/addressing-africas-extreme-water-insecurity/"},
{"Link": "https://www.brookings.edu/blog/future-development/2021/07/07/transnational-governance-of-natural-resources-for-the-21st-century/"},
{"Link": "https://www.brookings.edu/blog/up-front/2020/04/21/how-to-reduce-emissions-as-much-as-possible-at-the-lowest-cost/"},
{"Link": "https://www.brookings.edu/blog/the-avenue/2020/04/14/weakening-environmental-reviews-for-transportation-infrastructure-is-a-bridge-too-far/"},
{"Link": "https://www.brookings.edu/blog/africa-in-focus/2020/02/18/why-ethiopia-egypt-and-sudan-should-ditch-a-rushed-washington-brokered-nile-treaty/"},
{"Link": "https://www.brookings.edu/blog/up-front/2020/07/30/the-evolution-of-the-eiti-and-next-steps-for-tackling-extractive-industries-corruption/"},
{"Link": "https://www.brookings.edu/blog/up-front/2020/07/23/a-master-class-in-corruption-the-luanda-leaks-across-the-natural-resource-value-chain/"},
{"Link": "https://www.brookings.edu/blog/order-from-chaos/2020/07/20/the-damage-trumps-wall-causes-in-mexico/"},
{"Link": "https://www.brookings.edu/blog/future-development/2020/07/10/what-the-pandemic-reveals-about-governance-state-capture-and-natural-resources/"},
{"Link": "https://www.brookings.edu/blog/up-front/2017/12/20/estimating-the-rising-cost-of-a-surprising-tax-shelter-the-syndicated-conservation-easement/"},
{"Link": "https://www.brookings.edu/blog/future-development/2021/07/02/protecting-forests-are-early-warning-systems-effective/"},
{"Link": "https://www.brookings.edu/blog/fixgov/2021/06/24/when-climate-policy-works-hfcs-and-the-case-of-short-lived-climate-pollutants/"},
{"Link": "https://www.brookings.edu/blog/brown-center-chalkboard/2021/05/19/now-is-the-time-to-invest-in-school-infrastructure/"},
{"Link": "https://www.brookings.edu/blog/planetpolicy/2017/12/08/fill-the-gaps-in-the-tax-bill-with-a-carbon-tax-and-expanded-benefits-for-working-families/"},
{"Link": "https://www.brookings.edu/blog/order-from-chaos/2017/11/27/on-the-vices-and-virtues-of-trophy-hunting/"}
]
CodePudding user response:
Create main_dict
it will have all data as key value pair as URL and text assosiate to it
main_dict={}
for page in range(1,10):
dict_data={}
print(page)
res=requests.get(f"https://www.brookings.edu/topic/environment/page/{page}?type=posts")
soup=BeautifulSoup(res.text,"html.parser")
main_data=soup.find_all("h4",class_="title")
lst=[]
for i in main_data:
url=i.find("a")['href']
text=i.find("a").get_text(strip=True)
dict_data[url]=text
lst.append(dict_data)
main_dict[page]=lst
Output:
{1: [{'https://www.brookings.edu/blog/africa-in-focus/2021/12/06/focac-2021-chinas-retrenchment-from-africa/': 'FOCAC 2021: China’s retrenchment from Africa?',
'https://www.brookings.edu/blog/future-development/2021/12/06/back-to-the-future-climate-change-resilience-self-insurance-and-market-insurance/': 'Back to the future: Climate change resilience, self-insurance, and market insurance',
'https://www.brookings.edu/blog/order-from-chaos/2021/12/06/what-biden-should-say-to-putin-on-ukraine/': 'What Biden should say to Putin on Ukraine',
...