I am currently learning data scraping using the BeautifulSoup package. At the moment, I am trying to get a list of the movie franchises from the Box Office Mojo website (https://www.boxofficemojo.com/franchise/?ref_=bo_nb_bns_secondarytab).
The main problem is that I can't seem to access or extract the data within the <main> tag. Below is the code I am using.
import requests
from bs4 import BeautifulSoup
listOfFranchiseLink = "https://www.boxofficemojo.com/franchise/?ref_=bo_nb_bns_secondarytab"
r = requests.get(listOfFranchiseLink)
soup = BeautifulSoup(r.content, 'html.parser')
s0 = soup.find('div', id='a-page')
s1 = s0.find(id='')
s2 = s1.find('div', id='a-section mojo-body aok-relative')
assert s1 is not None
assert s2 is not None
While the script does find something with 's1', it doesn't seem like what I am expecting (which should contain a div with a class "a-section mojo-body aok-relative") at the top. Thus, I am getting None for 's2'.
My question is:
- What am I doing wrong? How can I extract data inside the <main> tag?
- I have a feeling creating a soup object for each layer is not very efficient. What is the more standard way to extract data buried within layers of different HTML tags?
Edit: Meant to write s0.find('main') instead of s0.find(id=''). But the former returned the same result as the latter, so it didn't really matter.
CodePudding user response:
It's because s2
is actually None
, because s1
returns this:
<script data-a-state='{"key":"a-wlab-states"}' type="a-state">{}</script>
so searching for id='a-section mojo-body aok-relative
should yield nothing. Hence the second assert fails.
If you want to scrape the table, you can go with just pandas
and requests
, like this:
import requests
import pandas as pd
df = (
pd.read_html(
requests.get(
"https://www.boxofficemojo.com/franchise/?ref_=bo_nb_bns_secondarytab"
).text,
flavor="lxml",
)[0]
)
print(df)
To get this:
Franchise ... Lifetime Gross
0 Marvel Cinematic Universe ... $858,373,000
1 Star Wars ... $936,662,225
2 Disney Live Action Reimaginings ... $543,638,043
3 Spider-Man ... $804,789,334
4 J.K. Rowling's Wizarding World ... $381,011,219
.. ... ... ...
287 Ip Man Franchise ... $2,679,437
288 Chal Mera Putt ... $644,000
289 Shiloh ... $1,007,822
290 Evangelion ... $174,945
291 V/H/S ... $100,345
[292 rows x 5 columns]