Home > Software engineering >  Loop scrapes the same page 20 times instead of iterating through range
Loop scrapes the same page 20 times instead of iterating through range

Time:11-01

I'm trying to scrape IMDB for a list of the top 1000 movies and get some details about them. However, when I run it, instead of getting the first 50 movies and going to the next page for the next 50, it repeats the loop and makes the same 50 entries 20 times in my database.

# Dataframe template
data = pd.DataFrame(columns=['ID','Title','Genre','Summary'])

#Get page data function
def getPageContent(start=1):
  start = 1
  url = 'https://www.imdb.com/search/title/?title_type=feature&year=1950-01-01,2019-12-31&sort=num_votes,desc&start=' str(start)
  r = requests.get(url)
  bs = bsp(r.text, "lxml")
  return bs

#Run for top 1000
for start in range(1,1001,50):
  getPageContent(start)
  movies = bs.findAll("div", "lister-item-content")
  for movie in movies:
    id = movie.find("span", "lister-item-index").contents[0]
    title = movie.find('a').contents[0]
    genres = movie.find('span', 'genre').contents[0]
    genres = [g.strip() for g in genres.split(',')]
    summary = movie.find("p", "text-muted").find_next_sibling("p").contents
  
    i = data.shape[0]
    data.loc[i] = [id,title,genres,summary]

#Clean data
# data.ID = [float(re.sub('.','',str(i))) for i in data.ID] #remove . from ID

data.head(51)

0 1. The Shawshank Redemption [Drama] [\nTwo imprisoned men bond over a number of ye... 1 2. The Dark Knight [Action, Crime, Drama] [\nWhen the menace known as the Joker wreaks h... 2 3. Inception [Action, Adventure, Sci-Fi] [\nA thief who steals corporate secrets throug... 3 4. Fight Club [Drama] [\nAn insomniac office worker and a devil-may-... ... 46 47. The Usual Suspects [Crime, Drama, Mystery] [\nA sole survivor tells of the twisty events ... 47 48. The Truman Show [Comedy, Drama] [\nAn insurance salesman discovers his whole l... 48 49. Avengers: Infinity War [Action, Adventure, Sci-Fi] [\nThe Avengers and their allies must be willi... 49 50. Iron Man [Action, Adventure, Sci-Fi] [\nAfter being held captive in an Afghan cave,... 50 1. The Shawshank Redemption [Drama] [\nTwo imprisoned men bond over a number of ye...

CodePudding user response:

Delete 'start' variable inside 'getPageContent' function. It assigns 'start=1' every time.

#Get page data function
def getPageContent(start=1):
  url = 'https://www.imdb.com/search/title/?title_type=feature&year=1950-01-01,2019-12-31&sort=num_votes,desc&start=' str(start)
  r = requests.get(url)
  bs = bsp(r.text, "lxml")
  return bs

CodePudding user response:

I was not able to test this code. See inline comments for what I see as the main issue.

# Dataframe template
data = pd.DataFrame(columns=['ID', 'Title', 'Genre', 'Summary'])


# Get page data function
def getPageContent(start=1):
    start = 1
    url = 'https://www.imdb.com/search/title/?title_type=feature&year=1950-01-01,2019-12-31&sort=num_votes,desc&start='   str(
        start)
    r = requests.get(url)
    bs = bsp(r.text, "lxml")
    return bs


# Run for top 1000
# for start in range(1, 1001, 50): # 50 is a
# step value so this gets every 50th movie
# Try 2 loops
start = 0
for group in range(0, 1001, 50):
    for item in range(group, group   50):
        getPageContent(item)
        movies = bs.findAll("div", "lister-item-content")
        for movie in movies:
            id = movie.find("span", "lister-item-index").contents[0]
            title = movie.find('a').contents[0]
            genres = movie.find('span', 'genre').contents[0]
            genres = [g.strip() for g in genres.split(',')]
            summary = movie.find("p", "text-muted").find_next_sibling("p").contents
    
            i = data.shape[0]
            data.loc[i] = [id, title, genres, summary]

# Clean data
# data.ID = [float(re.sub('.','',str(i))) for i in data.ID] #remove . from ID

data.head(51)
  • Related