The rating list website is : https://www.imdb.com/chart/top
Then follow by printing no. of the movies thus the first 6movies ratings as this:
I had used Beautiful Soup4 to extract previously but then it asked for only using regular expression to extract the 250 urls to look like image 1 which I am a little stuck. I can also use loops, but not any built in function for remove duplicates.
Thank you.
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = 'https://www.imdb.com/chart/top'
html = urlopen(url)
soup = BeautifulSoup(html, 'html.parser')
count = 0
all_urls = list()
for tdtag in soup.find_all(class_ = "titleColumn"):
url = tdtag.a['href']
all_urls.append(url)
count = 1
print('total of {} urls'.format(count))
data = np.array(all_urls)
print(data)
np.savetxt('urls.txt', data, fmt = '%s', encoding = 'utf-8')
CodePudding user response:
I'm not sure if this is exactly what you want but I would be happy to help you if you need any clarification.
You don't really need beautiful to scrape,
X.X based on X,X,X user ratings
Just use
import requests
data = requests.get('https://www.imdb.com/chart/top').text.split('\n')
To get the data
And then, you can use
rating = [i.split('"')[1] for i in data if ' user ratings">' in i]
with open("outfile", "w") as outfile:
outfile.write("\n".join(str(item) for item in rating))
To get your result
You can find your result in a file called outfile in the same dir
CodePudding user response:
Edit: (using regex)
import re
import requests
data = requests.get('https://www.imdb.com/chart/top').text
titles = re.findall('/title/\w*/(?=">)', data)
ratings = re.findall('\d\.\d.*ratings', data)
This saves all the title links to titles and rating sentences into ratings. Then you can e.g. print out the first six by:
for i in range(6):
print(f'No.{i 1}: {ratings[i]} (Link: {titles[i]})')
which outputs:
No.1: 9.2 based on 2,460,328 user ratings (Link: /title/tt0111161/)
No.2: 9.1 based on 1,701,913 user ratings (Link: /title/tt0068646/)
No.3: 9.0 based on 1,182,111 user ratings (Link: /title/tt0071562/)
No.4: 9.0 based on 2,415,762 user ratings (Link: /title/tt0468569/)
No.5: 8.9 based on 728,394 user ratings (Link: /title/tt0050083/)
No.6: 8.9 based on 1,264,883 user ratings (Link: /title/tt0108052/)
Old answer:
I could not quite get why you need to use regex, but if you want to get the text
X.X based on X,X,X user ratings
for a given title link e.g.
/title/tt0068646/
You can do:
title = '/title/tt0068646/'
# find the first link that points to title in soup
titlelink = soup.find(href = title)
titlelink.parent.parent.strong['title']
which outputs:
9.1 based on 1,701,913 user ratings