imdb top 250 movies scraper -- how to get results only from year 2000 and up?-CodePudding

I am new to Python (and coding in general) and scrapy so my knowledge of both is basic/limited (I am basically just copying code from various Google searches).

I managed to come up with a working code so far:

import scrapy

class imdb_project(scrapy.Spider):
    name = 'imdb'
    start_urls = ['https://www.imdb.com/chart/top']

    def parse(self, response):
        for i in response.css('.titleColumn a'):
            print(i.css('::text').get())

this code is working just fine. I am able to scrape all 250 movie titles on https://www.imdb.com/chart/top

Now, I would like to only scrape movie titles from the same page only if the movie came out in the year 2000 and up.

You can see here that the year is displayed right after the movie title. This should be pretty easy to do but for the life of me I cannot find a similar example on Google or even here on Stack Overflow to get me started in solving this.

I am thinking it should be something simple like:

if movie_year >= 2000:
    then run the 'for' loop above

...but I do not know how to code the above in Python. Any help (if possible, no regex or xpath please)?

CodePudding user response：

IMDb forbids scraping, but here is what it would look like if theoretically you were allowed to scrape from their site:

import scrapy

class imdb_project(scrapy.Spider):
    name = 'imdb'
    start_urls = ['https://www.imdb.com/chart/top']

    def parse(self, response):
        for i in response.css('.titleColumn'):
            title = i.css('a::text').get()
            year = i.css('.secondaryInfo::text').get()[1:-1] 

            if int(year) >= 2000:
                # or you could just do title if you don't
                # want the year
                print("{0} ({1})".format(title, year))