Home > Back-end >  Nesting parsers in Scrapy
Nesting parsers in Scrapy

Time:09-06

I'm using the Scrapy framework in Python to scrape data from this page. I'd like to create a single spider which will first follow the links to the six galleries, and then scrape some data from each of these pages and also follow the link in each to Read the Curators' Statement; from that page, I'd like to scrape the text of the statement. How should the parsers be nested in order to accomplish this task?

import scrapy


class GalleriesSpider(scrapy.Spider):
    name = "galleries"
    start_urls = ['https://www.exploratorium.edu/visit/galleries']

    def parse(self, response):
        galleries_page_links = response.xpath('//h2[text()="Museum Galleries"]/following-sibling::div//h5/a/@href')
        yield from response.follow_all(galleries_page_links, self.parse_gallery)

    def parse_gallery(self, response):
        def extract(query):
            return response.xpath(query).get(default='').replace(u'\xa0', u' ').strip()

        def extracts(query):
            return [item.replace(u'\xa0', u' ').strip() for item in response.xpath(query).getall()]

        # def parse_curator(response):
        #     def extracts_merge(query):
        #         return ' '.join(extracts(query))
        # 
        #     yield {
        #         'curator-statement': extracts_merge('//div[@id="main-content"]'
        #                                                       '//div[@]//p//text()')
        #     }

        # this_curator_url = extracts('//div[@id="main-content"]//p/a/@href')[-1]
        # this_curator_statement = response.follow(this_curator_url, parse_curator(this_curator_url))

        yield {
            'url': response.url,
            'title': extract('//div[@id="main-content"]//h1/text()'),
            'subtitle': extract('//div[@id="main-content"]//h3/text()'),
            'description': extract('//div[@id="main-content"]//h3/following-sibling::p/text()'),
            'highlights_url': extracts('//div[@]//h5/a/@href'),
            'curator-url': extract('//div[@id="main-content"]//p/a/@href'),
        }\
            #.update(this_curator_statement)

The code above produces a spider that scrapes data (as expected) from the gallery pages. However, when I tried to add the commented code, I get AttributeError: 'str' object has no attribute 'xpath'. I think it's because this_curator_url is not a Scrapy response object. What's the best way to nest parsers in this situation?

CodePudding user response:

You wouldn't need to nest the parsers at all. what you will want to do is create separate parse callback methods for each scraped page. and call them in sequence as you extract the neccessary data from each page. You can then pass the necessary information to the next parser through the cb_kwargs argument of the scrapy request so you can complete the item and yield the final result all at once.

For example

import scrapy

def extract(query, response):
    return response.xpath(query).get(default='').replace(u'\xa0', u' ').strip()

def extracts(query, response):
    return [item.replace(u'\xa0', u' ').strip() for item in response.xpath(query).getall()]

def extracts_merge(query, response):
    return ' '.join(extracts(query, response))

class GalleriesSpider(scrapy.Spider):
    name = "galleries"
    start_urls = ['https://www.exploratorium.edu/visit/galleries']

    def parse(self, response):
        galleries_page_links = response.xpath('//h2[text()="Museum Galleries"]/following-sibling::div//h5/a/@href')
        yield from response.follow_all(galleries_page_links, self.parse_gallery)

    def parse_gallery(self, response):
        kwargs = {
            'url': response.url,
            'title': extract('//div[@id="main-content"]//h1/text()', response),
            'subtitle': extract('//div[@id="main-content"]//h3/text()', response),
            'description': extract('//div[@id="main-content"]//h3/following-sibling::p/text()', response),
            'highlights_url': extracts('//div[@]//h5/a/@href', response),
            'curator-url': extract('//div[@id="main-content"]//p/a/@href', response),
        }
        url = response.urljoin(kwargs['curator-url'])
        yield scrapy.Request(url, self.parse_curator, cb_kwargs=kwargs)

    def parse_curator(self, response, **kwargs):
        kwargs['curator-statement'] = extracts_merge('//div[@id="main-content"]//div[@]//p//text()', response)
        yield kwargs

OUTPUT

2022-09-02 18:07:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.exploratorium.edu/visit/outdoor-gallery/curator-statement> (referer: https://www.exploratorium.edu/visit/gallery-5)
2022-09-02 18:07:50 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.exploratorium.edu/visit/west-gallery/curator-statement>
{'url': 'https://www.exploratorium.edu/visit/gallery-1', 'title': 'Bernard and Barbro Osher Gallery 1: Human Phenomena', 'subtitle': 'Experiment with thoughts, feelings, and social behavior.', 'description': 'Humans think, feel, and interact, and these phenomena are all open to scientific investigation and creative exploration. Here, you and others are the exhibits—so play with social interactions, observe others, and contribute yourreflections.', 'highlights_url': ['/arts/black-box', '/visit/calendar/stories-of-change', '/exhibits/recollections', '/exhibits/pi-has-your-number', '/exhibits/catenary-arch', '/exhibits/survival-game'], 'curator-url': '/visit/west-gallery/curator-statement', 'curator-statement': "The experiences in the Osher Gallery focus on cognition, emotion, social behavior, and the interplay between science, society, art, and culture. We all perceive the world, remember the past, look forward to the future, and communicate with each other—and both scientists and artists investigate how and why we do so. In this gallery, you can explore how your mind works and learn about the scientific study of human behavior through exhibits on emotion, language, memory, and pattern recognition. The space is also home to Science of Sharing , a project funded by the National Science Foundation to develop exhibits that let you experiment with cooperation, competition, and strategies for sharing resources. Here, you're the exhibit; the mechanisms presented here are just tools through which you can play with and reflect on your experiences. The gallery is also a venue for dynamic temporary exhibitions; the first was The Changing Face of What Is Normal , a collection of artifacts and experiences exploring the evolving natureof normality and the lives of those affected by mental illness. In addition, the Black Box offers a state-of-the-art immersive environment for large, media-based exhibitions by visiting artists. The gallery also features works by past and present Exploratorium artists-in-residence.  Pamela Winfrey , Curator  Hugh McDonald , Associate Curator"}
2022-09-02 18:07:50 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.exploratorium.edu/visit/bay-observatory-gallery/curator-statement>
{'url': 'https://www.exploratorium.edu/visit/gallery-6', 'title': 'Fisher Bay Observatory Gallery 6: Observing Landscapes', 'subtitle': 'Uncover the history, geography, and ecology of the Bay Area.', 'description': 'Natural and human forces interact to create the dynamic landscape surrounding us. Learn to uncover the stories embedded in a place by directly observing the geography, history, and ecology of the San Francisco Bay region.', 'highlights_url': ['/environmental-field-station', '/visit/calendar/conversations-about-landscape', '/exhibits/library-of-earth-anatomy', '/exhibits/bay-lexicon', '/exhibits/visualizing-the-bay-area', '/exhibits/timepieces'], 'curator-url': '/visit/bay-observatory-gallery/curator-statement', 'curator-statement': 'This second-floor, indoor/outdoor exhibition space features spectacular views of the Bay and San Francisco’s northern waterfront, as well as its urban, downtown cityscape. The Fisher Bay Observatory Gallery  and Terrace use these views as an entry point for investigations of the history and dynamic processes in the local landscape, and the human impact. The exhibits, artworks, and instruments here probe the environment from multiple perspectives, such as physical and geographic sciences, ecology, astronomy, history, and contemporary experience. A smallbrowsing library of maps and books from the past and present helps visitors explore ideas that shape the Bay Area. The Fisher Bay Observatory Gallery also introduces visitors to the process of observation, and the tools and methods scientists use to gather information about the world around us. Some instruments like cameras and telescopes help us observe the landscape directly, while other exhibits present live or archived data or, visualizations, and eventually video streams, creating a picture of our surroundings that we otherwise might never see. Susan Schwartzenberg , Curator'}
2022-09-02 18:07:50 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.exploratorium.edu/visit/east-gallery/curator-statement>
{'url': 'https://www.exploratorium.edu/visit/gallery-4', 'title': 'Gordon and Betty Moore Gallery 4: Living Systems', 'subtitle': 'Explore life from DNA and cells to organisms and ecosystems.', 'description': 'Sometimes life is hard to observe, because it’s too tiny or fast or is hidden underground or in the ocean. Discover what you’ve been missing: use scientific tools to investigate living things of different sizes, the ecosystems they inhabit, and the processes they share.', 'highlights_url': ['/cellstoself', '/exhibits/living-systems-explainer-station', '/exhibits/plankton-populations', '/exhibits/tidal-memory', '/exhibits/live-chicken-embryos',  'http://www.exploratorium.edu/imaging_station/'], 'curator-url': '/visit/east-gallery/curator-statement', 'curator-statement': 'Gallery 4 fosters an appreciation of the living world and the many ways to explore it. Using authentic scientific methods and tools, visitors learn about living things at different scales, the processes they share, and their ecosystems. Anchored by the Life Sciences Laboratory, a working laboratory that cultivates organisms for exhibits, the gallery is a dynamic space where contributions by the scientific and artistic communities come together to provide unique and engaging experiences. A Cells and Development section includes the renovated Microscope Imaging Station, which gives visitors a direct look through research-grade microscopes at stem cell biology and other aspects of development. Living Liquid is a new exhibit area that focuses on tiny drifting marine organisms called plankton as well as often-overlooked species in the Bay and the ocean beyond. Life around Us blends new and classic exhibits that examine familiar organisms and reveal their amazing behaviors and unusual features. Located at the eastern end of Pier 15, with a magnificent view of the Bay, the gallery invites an exploratory yet contemplative interaction with the biological world. Kristina Yu ,Curator  Jennifer Frazier , Associate Curator'}
2022-09-02 18:07:50 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.exploratorium.edu/visit/central-gallery/curator-statement>
{'url': 'https://www.exploratorium.edu/visit/gallery-3', 'title': 'Bechtel Gallery 3: Seeing & Reflections', 'subtitle': 'Experiment with light, mirrors, and bubbles.', 'description': 'Our eyes respond to light, but this is just one aspect of how we perceive the world. Playing with light is a great way to learn how it works. And investigating real phenomena can give you a deeper understanding of the scientific process.', 'highlights_url': ['/exhibits/cubatron-core', '/exhibits/giant-mirror', '/exhibits/soap-film-painting', '/exhibits/colored-shadows', '/exhibits/monochromatic-room', '/exhibits/out-quiet-yourself'], 'curator-url': '/visit/central-gallery/curator-statement', 'curator-statement': 'Bechtel Gallery 3 is the heart of the Exploratorium, a place designed to spark and nurture visitors’ curiosity and challenge them to investigate natural phenomena for themselves—with tools and gentle guidance to catalyze their explorations. The gallery features many of our favorite classic exhibits, but it also introduces new exhibits and experimental prototypes that reflect our efforts to share current science as it advances. The primary activity in the gallery is experimentation in the broadest sense. Visitors are encouraged to discover things for themselves through exhibits designed as experiments, with opportunities for experimental variations and controls. Most importantly, visitors have a unique opportunity to learn-by-doing about the scientific process itself, the power of experiment to answer questions, and the roles of knowledge and creativity in discovering connections among diverse phenomena. Immersive and evocative experiences will inspire further explorations. Thomas Humphrey, Curator Richard O. Brown, Associate Curator'}
2022-09-02 18:07:50 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.exploratorium.edu/visit/south-gallery/curator-statement>
{'url': 'https://www.exploratorium.edu/visit/gallery-2', 'title': 'Gallery 2: Tinkering', 'subtitle': 'Explore your creativity and our curious contraptions.', 'description': 'Making things and developing ideas by hand helps us construct understanding. Slow down, settle in, and make something personally meaningful—from playful contraptions to surprising connections between mechanical systems and natural phenomena.', 'highlights_url': ['/exhibits/tinkerers-clock', 'https://www.exploratorium.edu/tinkering/blog', '/video/art-tinkering-scott-weavers-100000-toothpick-sculpture-san-francisco', '/exhibits/your-turn-counts', '/exhibits/lariat-chain', 'https:
//www.exploratorium.edu/tinkering/projects/cardboard-automata', 'https://www.exploratorium.edu/tinkering/projects/chain-reaction', 'https://www.exploratorium.edu/tinkering/projects/circuit-boards'], 'curator-url': '/visit/south-gallery/curator-statement', 'curator-statement': 'A tall, fanciful, interactive Tinkerer’s Clock towers over Gallery 2, welcoming you to a public workshop area where you can make, build, or tinker, either alone or with others, as a way of exploring the world and your own creativity. Here, familiar materials are used in unfamiliar ways, and exhibits highlight the beauty—and, sometimes, whimsy—of scientific complexity and discovery. The Tinkering Studio is the heart of this gallery. In this immersive space, visitors use tools and materials to explore the intersection of science, art, and technology. We try experiments for the first time, or play along with other makers and artists. Whether expert of novice, we’re all learning together by making something that is personally meaningful. Adjacent to the gallery is the museum’s exhibit-building workshop, whee most of our exhibits are made. Open to public view, you’ll see our staff working with a variety of materials—woodworking tools, drills, and lathes, for example—and some of our exhibits in various stages of development. The Learning Studio is also in the gallery space. It serves as a research-and-development lab for staff and artists/collaborators. Here, we try things out, make mistakes, get excited, become delighted, and every now and then stumble on to something great that we share with visitors in the Tinkering Studio and in professional development workshops for teachers and museum educators.  Mike Petrich and Karen Wilkinson, Curators '}
2022-09-02 18:07:50 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.exploratorium.edu/visit/outdoor-gallery/curator-statement>
{'url': 'https://www.exploratorium.edu/visit/gallery-5', 'title': 'Gallery 5: Outdoor Exhibits', 'subtitle': 'Explore winds, tides, and natural phenomena.', 'description': 'Investigate forces shaping the City, Bay, and region. Watch shifting winds and tides, reveal hidden life, shake a bridge, observe human behavior, and find new ways to notice the places we inhabit.', 'highlights_url': ['/exhibits/aeolian-harp', '/exhibits/color-of-water', '/exhibits/bike-rope-squirter', '/exhibit/wind-arrows', '/exhibits/sun-swarm', '/exhibits/research-buoy', '/exhibits/golden-gate-bridge', '/exhibits/disappearing-rings', '/visit/outdoor-gallery/remote-rains'], 'curator-url': '/visit/outdoor-gallery/curator-statement', 'curator-statement': 'The guiding principle of the Gallery 5 is to support and expand the Exploratorium’s role as a community museum dedicated to awareness. Helping to reinvent the civic role of a public museum as a place to gather and exchange ideas, the gallery also exemplifies how direct observations of natural and urban phenomena can blossom into artistic endeavors, scientific investigations, and open-ended inquiries. The gallery features a combination of large- and small-scale exhibits, rotating art installations, and public programs  (including vendors, performance artists, and public exhibitions). Our defining location on the urban edge of the city and the Bay enhances visitors’ ability to perceive their surroundings with heightened precision and clarity that leads to deepened insight and understanding. The gallery team is also extending its efforts beyond the boundaries of the Exploratorium campus, developing community-based partnerships that stretch throughout San Francisco and the Bay Area to create interactive outposts that both engage and delight. Shawn Lani , Curator Eric Dimond, Associate Curator'}
  • Related