As I said in the title, I'm scraping some information on Letterboxd and need help.
I already have a function where I can scrape all info that I need (such as name, date, cast etc) from a URL like this https://letterboxd.com/film/when-marnie-was-there/
The point is that I also want to scrape all the movies I've already watched (which you can find here https://letterboxd.com/gfac/films/diary/) and after that use their URL to run my other function.
But looking into the devtools on my browser I can't find the complete movie URL in my diary. So I was thinking if I can extract one of the two pieces of info highlighted in the screenshot. If yes, I can after concatenate
"https://letterboxd.com/" "film/when-marnie-was-there/"
and run my other function.
This is what I got until now:
def teste(url):
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
elem = soup.find_all("h3", {"class": "headline-3 prettify"})[0]
return elem
a = teste("https://letterboxd.com/gfac/films/diary/")
print(a)
<h3 ><a href="/gfac/film/when-marnie-was-there/">When Marnie Was There</a></h3>
CodePudding user response:
Using the diary
link that you provided, the following code works at retreiving and creating a list of film urls from that page. You will have to find a way to provide the code with the second page of your diary, and third or fourth as it grows.
url = 'https://letterboxd.com/gfac/films/diary'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
links = [link['href'] for link in soup.find_all('a')]
films = [url[0:22] i[5:] for i in list(dict.fromkeys(links)) if '/gfac/film/' in i]
urls
is created by finding every <a>
tag within soup
and filtering out every instance of href
(hyperlinks in html). This results in a list of many unwanted links and duplicates.
films
is a list of film urls created by first filtering out duplicate links, then filtering anything that does not redirect to a film page (which is denoted by /film/
after the base url). It scrapes away /gfac/
in the string, and finally adds the letterboxd base url to the beginning of each item creating a proper link.
CodePudding user response:
You are on the right track, so extract the href
value with .get('href)
and concat with base url - to generate a list
of urls that you can iterate to scrape use a list comprehension
:
diary_urls = ['https://letterboxd.com' a.get('href').replace('/gfac','') for a in soup.select('h3>a[href]')]
Note: You could go with find_all(), I used select
and css selectors
for convenience here and to select the elements more specific - Only direct <h3>
following <a>
with href
attribute
Example
import requests
from bs4 import BeautifulSoup
r = requests.get('https://letterboxd.com/gfac/films/diary/')
soup = BeautifulSoup(r.content, "html.parser")
diary_urls = ['https://letterboxd.com' a.get('href').replace('/gfac','') for a in soup.select('h3>a[href]')]
data = []
for url in diary_urls[:2]:
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
data.append({
'title': soup.select_one('#film-page-wrapper h1').get_text(),
'cast':soup.select_one('#tab-cast p').get_text(',',strip=True),
'what ever':'you like to scrape'
})
data
Output
[{'title': 'When Marnie Was There', 'cast': 'Sara Takatsuki,Kasumi Arimura,Nanako Matsushima,Susumu Terajima,Toshie Negishi,Ryôko Moriyama,Kazuko Yoshiyuki,Hitomi Kuroki,Hiroyuki Morisaki,Takuma Otoo,Hana Sugisaki,Bari Suzuki,Shigeyuki Totsugi,Ken Yasuda,Yo Oizumi,Yuko Kaida', 'what ever': 'you like to scrape'}, {'title': 'The Fly', 'cast': 'Jeff Goldblum,Geena Davis,John Getz,Joy Boushel,Leslie Carlson,George Chuvalo,Michael Copeman,David Cronenberg,Carol Lazare,Shawn Hewitt,Typhoon', 'what ever': 'you like to scrape'},...]