I am new to Python (and coding in general) and scrapy so my knowledge of both is basic/limited (I am basically just copying code from various Google searches).
I managed to come up with a working code so far:
import scrapy
class imdb_project(scrapy.Spider):
name = 'imdb'
start_urls = ['https://www.imdb.com/chart/top']
def parse(self, response):
for i in response.css('.titleColumn a'):
print(i.css('::text').get())
this code is working just fine. I am able to scrape all 250 movie titles on https://www.imdb.com/chart/top
Now, I would like to only scrape movie titles from the same page only if the movie came out in the year 2000 and up.
You can see here that the year is displayed right after the movie title. This should be pretty easy to do but for the life of me I cannot find a similar example on Google or even here on Stack Overflow to get me started in solving this.
I am thinking it should be something simple like:
if movie_year >= 2000:
then run the 'for' loop above
...but I do not know how to code the above in Python. Any help (if possible, no regex or xpath please)?
CodePudding user response:
IMDb forbids scraping, but here is what it would look like if theoretically you were allowed to scrape from their site:
import scrapy
class imdb_project(scrapy.Spider):
name = 'imdb'
start_urls = ['https://www.imdb.com/chart/top']
def parse(self, response):
for i in response.css('.titleColumn'):
title = i.css('a::text').get()
year = i.css('.secondaryInfo::text').get()[1:-1]
if int(year) >= 2000:
# or you could just do title if you don't
# want the year
print("{0} ({1})".format(title, year))