I'm trying to learn how to scrap components from website, specifically this website https://genshin-impact.fandom.com/wiki/Serenitea_Pot/Load
When I follow guidance from the internet, I collect several important elements such as class
"article-table sortable mw-collapsible jquery-tablesorter mw-made-collapsible"
and html elements like th and tb to get specific content of it using this code
import requests
from bs4 import BeautifulSoup
URL = "https://genshin-impact.fandom.com/wiki/Serenitea_Pot/Load"
page = requests.get(URL)
#print(page.text)
soup = BeautifulSoup(page.content, "html.parser")
results = soup.find(id="mw-content-text")
teapot_loads = results.find_all("table", class_="article-table sortable mw-collapsible jquery-tablesorter mw-made-collapsible")
for teapot_loads in teapot_loads:
table_head_element = teapot_loads.find("th", class_="headerSort")
print(table_head_element)
print()
I seem to have written the correct element (th) and correct class name "headerSort." But the program doesn't return anything although there's no error in the program as well. What did I do wrong?
CodePudding user response:
You can debug your code to see what went wrong, where. One such debugging effort is below, where we keep only one class for tables, and then print out the full class of the actual elements:
import requests
from bs4 import BeautifulSoup
URL = "https://genshin-impact.fandom.com/wiki/Serenitea_Pot/Load"
page = requests.get(URL)
#print(page.text)
soup = BeautifulSoup(page.content, "html.parser")
results = soup.find(id="mw-content-text")
# print(results)
teapot_loads = results.find_all("table", class_="article-table")
for teapot_load in teapot_loads:
print(teapot_load.get_attribute_list('class'))
table_head_element = teapot_load.find("th", class_="headerSort")
print(table_head_element)
This will print out (beside the element you want printed out) the table class as well, as seen by requests/BeautifulSoup: ['article-table', 'sortable', 'mw-collapsible']
. After the original HTML loads in page (with the original classes, seen by requests/BeautifulSoup), the Javascript in that page kicks in, and adds new classes to the table. As you are searching for elements containing such dynamically added classes, your search fails.
Nonetheless, here is a more elegant way of obtaining that table:
import pandas as pd
url = 'https://genshin-impact.fandom.com/wiki/Serenitea_Pot/Load'
dfs = pd.read_html(url)
print(dfs[1])
This will return a dataframe with that table:
Image | Name | Adeptal Energy | Load | ReducedLoad | Ratio | |
---|---|---|---|---|---|---|
0 | nan | "A Bloatty Floatty's Dream of the Sky" | 60 | 65 | 47 | 0.92 |
1 | nan | "A Guide in the Summer Woods" | 60 | 35 | 24 | 1.71 |
2 | nan | "A Messenger in the Summer Woods" | 60 | 35 | 24 | 1.71 |
3 | nan | "A Portrait of Paimon, the Greatest Companion" | 90 | 35 | 24 | 2.57 |
4 | nan | "A Seat in the Wilderness" | 20 | 50 | 50 | 0.4 |
5 | nan | "Ballad-Spinning Windwheel" | 90 | 185 | 185 | 0.49 |
6 | nan | "Between Nine Steps" | 30 | 550 | 550 | 0.05 |
[...]
Documentation for bs4 (BeautifulSoup) can be found at https://www.crummy.com/software/BeautifulSoup/bs4/doc/#
Also, docs for pandas.read_html
: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html