How to scrape information from diary letterboxd?-CodePudding

As I said in the title, I'm scraping some information on Letterboxd and need help.

I already have a function where I can scrape all info that I need (such as name, date, cast etc) from a URL like this https://letterboxd.com/film/when-marnie-was-there/

The point is that I also want to scrape all the movies I've already watched (which you can find here https://letterboxd.com/gfac/films/diary/) and after that use their URL to run my other function.

But looking into the devtools on my browser I can't find the complete movie URL in my diary. So I was thinking if I can extract one of the two pieces of info highlighted in the screenshot. If yes, I can after concatenate

"https://letterboxd.com/"   "film/when-marnie-was-there/"

and run my other function.

This is what I got until now:

def teste(url):

r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")

elem = soup.find_all("h3", {"class": "headline-3 prettify"})[0]

return elem

a = teste("https://letterboxd.com/gfac/films/diary/")
print(a)

<h3 ><a href="/gfac/film/when-marnie-was-there/">When Marnie Was There</a></h3>

CodePudding user response：

Using the diary link that you provided, the following code works at retreiving and creating a list of film urls from that page. You will have to find a way to provide the code with the second page of your diary, and third or fourth as it grows.

url = 'https://letterboxd.com/gfac/films/diary'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')

links = [link['href'] for link in soup.find_all('a')]
films = [url[0:22]   i[5:] for i in list(dict.fromkeys(links)) if '/gfac/film/' in i]

urls is created by finding every <a> tag within soup and filtering out every instance of href (hyperlinks in html). This results in a list of many unwanted links and duplicates.

films is a list of film urls created by first filtering out duplicate links, then filtering anything that does not redirect to a film page (which is denoted by /film/ after the base url). It scrapes away /gfac/ in the string, and finally adds the letterboxd base url to the beginning of each item creating a proper link.

CodePudding user response：

You are on the right track, so extract the href value with .get('href) and concat with base url - to generate a list of urls that you can iterate to scrape use a list comprehension:

diary_urls = ['https://letterboxd.com'   a.get('href').replace('/gfac','') for a in soup.select('h3>a[href]')]

Note: You could go with find_all(), I used select and css selectors for convenience here and to select the elements more specific - Only direct <h3> following <a> with href attribute

Example

import requests
from bs4 import BeautifulSoup

r = requests.get('https://letterboxd.com/gfac/films/diary/')
soup = BeautifulSoup(r.content, "html.parser")

diary_urls = ['https://letterboxd.com'   a.get('href').replace('/gfac','') for a in soup.select('h3>a[href]')]

data = []
for url in diary_urls[:2]:
    r = requests.get(url)
    soup = BeautifulSoup(r.content, "html.parser")
    data.append({
        'title': soup.select_one('#film-page-wrapper h1').get_text(),
        'cast':soup.select_one('#tab-cast p').get_text(',',strip=True),
        'what ever':'you like to scrape'
    })
    
data

Output

[{'title': 'When Marnie Was There', 'cast': 'Sara Takatsuki,Kasumi Arimura,Nanako Matsushima,Susumu Terajima,Toshie Negishi,Ryôko Moriyama,Kazuko Yoshiyuki,Hitomi Kuroki,Hiroyuki Morisaki,Takuma Otoo,Hana Sugisaki,Bari Suzuki,Shigeyuki Totsugi,Ken Yasuda,Yo Oizumi,Yuko Kaida', 'what ever': 'you like to scrape'}, {'title': 'The Fly', 'cast': 'Jeff Goldblum,Geena Davis,John Getz,Joy Boushel,Leslie Carlson,George Chuvalo,Michael Copeman,David Cronenberg,Carol Lazare,Shawn Hewitt,Typhoon', 'what ever': 'you like to scrape'},...]