The rating list website is : https://www.imdb.com/chart/top

I would need to extract all 250 urls from the website above and the output would look like this which is saved in txt file.

Then follow by printing no. of the movies thus the first 6movies ratings as this:

I had used Beautiful Soup4 to extract previously but then it asked for only using regular expression to extract the 250 urls to look like image 1 which I am a little stuck. I can also use loops, but not any built in function for remove duplicates.

Thank you.

from urllib.request import urlopen
from bs4 import BeautifulSoup

url = 'https://www.imdb.com/chart/top'
html = urlopen(url)
soup = BeautifulSoup(html, 'html.parser')

count = 0
all_urls = list()

for tdtag in soup.find_all(class_ = "titleColumn"):
    url = tdtag.a['href']
    all_urls.append(url)
    count  = 1

print('total of {} urls'.format(count))

data = np.array(all_urls)
print(data)

np.savetxt('urls.txt', data, fmt = '%s', encoding = 'utf-8')

CodePudding user response：

I'm not sure if this is exactly what you want but I would be happy to help you if you need any clarification.

You don't really need beautiful to scrape,

X.X based on X,X,X user ratings

Just use

import requests
data = requests.get('https://www.imdb.com/chart/top').text.split('\n')

To get the data

And then, you can use

rating = [i.split('"')[1] for i in data if ' user ratings">' in i]
with open("outfile", "w") as outfile:
    outfile.write("\n".join(str(item) for item in rating))

To get your result

You can find your result in a file called outfile in the same dir

CodePudding user response：

Edit: (using regex)

import re
import requests

data = requests.get('https://www.imdb.com/chart/top').text
titles = re.findall('/title/\w*/(?=">)', data)
ratings = re.findall('\d\.\d.*ratings', data)

This saves all the title links to titles and rating sentences into ratings. Then you can e.g. print out the first six by:

for i in range(6):
    print(f'No.{i 1}: {ratings[i]} (Link: {titles[i]})')

which outputs:

No.1: 9.2 based on 2,460,328 user ratings (Link: /title/tt0111161/)
No.2: 9.1 based on 1,701,913 user ratings (Link: /title/tt0068646/)
No.3: 9.0 based on 1,182,111 user ratings (Link: /title/tt0071562/)
No.4: 9.0 based on 2,415,762 user ratings (Link: /title/tt0468569/)
No.5: 8.9 based on 728,394 user ratings (Link: /title/tt0050083/)
No.6: 8.9 based on 1,264,883 user ratings (Link: /title/tt0108052/)

Old answer:

I could not quite get why you need to use regex, but if you want to get the text

X.X based on X,X,X user ratings

for a given title link e.g.

/title/tt0068646/

You can do:

title = '/title/tt0068646/'
# find the first link that points to title in soup
titlelink = soup.find(href = title)

titlelink.parent.parent.strong['title']

which outputs:

9.1 based on 1,701,913 user ratings