Home > Back-end >  Python: Extracting HTML <main> Data Using BeautifulSoup
Python: Extracting HTML <main> Data Using BeautifulSoup

Time:06-23

I am currently learning data scraping using the BeautifulSoup package. At the moment, I am trying to get a list of the movie franchises from the Box Office Mojo website (https://www.boxofficemojo.com/franchise/?ref_=bo_nb_bns_secondarytab).

The main problem is that I can't seem to access or extract the data within the <main> tag. Below is the code I am using.

import requests
from bs4 import BeautifulSoup

listOfFranchiseLink = "https://www.boxofficemojo.com/franchise/?ref_=bo_nb_bns_secondarytab"

r = requests.get(listOfFranchiseLink)
soup = BeautifulSoup(r.content, 'html.parser')

s0 = soup.find('div', id='a-page')
s1 = s0.find(id='')
s2 = s1.find('div', id='a-section mojo-body aok-relative')

assert s1 is not None
assert s2 is not None

While the script does find something with 's1', it doesn't seem like what I am expecting (which should contain a div with a class "a-section mojo-body aok-relative") at the top. Thus, I am getting None for 's2'.

My question is:

  1. What am I doing wrong? How can I extract data inside the <main> tag?
  2. I have a feeling creating a soup object for each layer is not very efficient. What is the more standard way to extract data buried within layers of different HTML tags?

Edit: Meant to write s0.find('main') instead of s0.find(id=''). But the former returned the same result as the latter, so it didn't really matter.

CodePudding user response:

It's because s2 is actually None, because s1 returns this:

<script data-a-state='{"key":"a-wlab-states"}' type="a-state">{}</script>

so searching for id='a-section mojo-body aok-relative should yield nothing. Hence the second assert fails.

If you want to scrape the table, you can go with just pandas and requests, like this:

import requests
import pandas as pd

df = (
    pd.read_html(
        requests.get(
            "https://www.boxofficemojo.com/franchise/?ref_=bo_nb_bns_secondarytab"
        ).text,
        flavor="lxml",
    )[0]
)
print(df)

To get this:

                           Franchise  ... Lifetime Gross
0          Marvel Cinematic Universe  ...   $858,373,000
1                          Star Wars  ...   $936,662,225
2    Disney Live Action Reimaginings  ...   $543,638,043
3                         Spider-Man  ...   $804,789,334
4     J.K. Rowling's Wizarding World  ...   $381,011,219
..                               ...  ...            ...
287                 Ip Man Franchise  ...     $2,679,437
288                   Chal Mera Putt  ...       $644,000
289                           Shiloh  ...     $1,007,822
290                       Evangelion  ...       $174,945
291                            V/H/S  ...       $100,345

[292 rows x 5 columns]
  • Related