I have a problem with a webpage when I try to extract a table. My code is:
import requests
from bs4 import BeautifulSoup
url ='https://www.ismworld.org/supply-management-news-and-reports/reports/ism-report-on-business/pmi/august/'
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36'}
data = requests.get(url, headers).text
soup = BeautifulSoup(data, 'html.parser')
t=soup.find("table", {"class": "table table-bordered table-hover table-responsive mb-4"})
print(t)
When I print "t", I have a None. What is wrong in the code?
Thanks!
CodePudding user response:
Make your life easier and give pandas
a try.
To get all the tables, try this:
import requests
import pandas as pd
url = 'https://www.ismworld.org/supply-management-news-and-reports/reports/ism-report-on-business/pmi/august/'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36',
}
df = pd.read_html(requests.get(url, headers=headers).text, flavor='bs4')
print(df[0])
Sample output of the first table:
Index ... Trend* (Months)
0 Manufacturing PMI® ... 27
1 New Orders ... 1
2 Production ... 27
3 Employment ... 1
4 Supplier Deliveries ... 78
5 Inventories ... 13
6 Customers’ Inventories ... 71
7 Prices ... 27
8 Backlog of Orders ... 26
9 New Export Orders ... 1
10 Imports ... 3
11 OVERALL ECONOMY ... 27
12 Manufacturing Sector ... 27
[13 rows x 7 columns]
CodePudding user response:
The problem is not with your code, but with the website response. Try adding the following code snippet after you send the request:
file = open("ismworld.html", "w")
file.write(data)
file.close()
And then check the content of the text file. You'll notice that the response from the website doesn't contain a "table" in the first place, because the website detected your request as being automated and blocked you.
If you try looking more into it, multiple solutions are available to avoid this (User-Agent randomization, IP rotation, using a browser to send requests etc.).
But if you want to focus more on handling the data, rather than on the actual web scraping implementation, you could also try WebScrapingAPI. The service handles by default all these detection issues and has an extract_rules
feature that returns elements in JSON format, based on the CSS selector you specify.
Here is a Python example adjusted for your case:
import requests
import json
site = "https://www.ismworld.org/supply-management-news-and-reports/reports/ism-report-on-business/pmi/august/"
url = "https://api.webscrapingapi.com/v1"
extract_rules = {
"table": {
"selector": "table.table.table-bordered.table-hover.table-responsive.mb-4",
"output": "html"
}
}
params = {
"api_key": "YOUR_API_KEY",
"url": site,
"render_js": "1",
"extract_rules": json.dumps(extract_rules)
}
response = requests.get(url, params=params)
print(response.text)
And the response:
{"table":["<table class=\"table table-bordered table-hover table-responsive mb-4\">\n<thead>\n<tr>\n<th
class=\"text-center\" scope=\"col\">Index</th>\n<th class=\"text-center\" scope=\"col\">Series Index Aug
</th>\n<th class=\"text-center\" scope=\"col\">Series Index Jul</th>\n<th class=\"text-center\"
scope=\"col\">Percentage Point Change</th>\n<th class=\"text-center\" scope=\"col\">Direction</th>\n<th
class=\"text-center\" scope=\"col\">Rate of Change</th>\n<th class=\"text-center\" scope=\"col\">Trend*
(Months)</th>\n</tr>\n</thead>\n<tbody>\n<tr>
<!-- Table#-Row#-Column# -->\n<th scope=\"row\">Manufacturing PMI<sup>®</sup></th>\n<td
class=\"text-center\">52.8</td>\n<td class=\"text-center\">52.8</td>\n<td class=\"text-center\">0.0</td>
\n<td class=\"text-center\">Growing</td>\n<td class=\"text-center\">Same</td>\n<td class=\"text-center\">27
</td>\n
</tr>\n<tr>
<!-- Table#-Row#-Column# -->\n<th scope=\"row\">New Orders</th>\n<td class=\"text-center\">51.3</td>\n<td
class=\"text-center\">48.0</td>\n<td class=\"text-center\"> 3.3</td>\n<td class=\"text-center\">Growing
</td>\n<td class=\"text-center\">From Contracting</td>\n<td class=\"text-center\">1</td>\n
</tr>\n<tr>
<!-- Table#-Row#-Column# -->\n<th scope=\"row\">Production</th>\n<td class=\"text-center\">50.4</td>\n<td
class=\"text-center\">53.5</td>\n<td class=\"text-center\">-3.1</td>\n<td class=\"text-center\">Growing
</td>\n<td class=\"text-center\">Slower</td>\n<td class=\"text-center\">27</td>\n
</tr>\n<tr>
<!-- Table#-Row#-Column# -->\n<th scope=\"row\">Employment</th>\n<td class=\"text-center\">54.2</td>\n<td
class=\"text-center\">49.9</td>\n<td class=\"text-center\"> 4.3</td>\n<td class=\"text-center\">Growing
</td>\n<td class=\"text-center\">From Contracting</td>\n<td class=\"text-center\">1</td>\n
</tr>\n<tr>
<!-- Table#-Row#-Column# -->\n<th scope=\"row\">Supplier Deliveries</th>\n<td class=\"text-center\">55.1
</td>\n<td class=\"text-center\">55.2</td>\n<td class=\"text-center\">-0.1</td>\n<td class=\"text-center\">
Slowing</td>\n<td class=\"text-center\">Slower</td>\n<td class=\"text-center\">78</td>\n
</tr>\n<tr>
<!-- Table#-Row#-Column# -->\n<th scope=\"row\">Inventories</th>\n<td class=\"text-center\">53.1</td>\n<td
class=\"text-center\">57.3</td>\n<td class=\"text-center\">-4.2</td>\n<td class=\"text-center\">Growing
</td>\n<td class=\"text-center\">Slower</td>\n<td class=\"text-center\">13</td>\n
</tr>\n<tr>
<!-- Table#-Row#-Column# -->\n<th scope=\"row\">Customers’ Inventories</th>\n<td class=\"text-center\">38.9
</td>\n<td class=\"text-center\">39.5</td>\n<td class=\"text-center\">-0.6</td>\n<td class=\"text-center\">
Too Low</td>\n<td class=\"text-center\">Faster</td>\n<td class=\"text-center\">71</td>\n
</tr>\n<tr>
<!-- Table#-Row#-Column# -->\n<th scope=\"row\">Prices</th>\n<td class=\"text-center\">52.5</td>\n<td
class=\"text-center\">60.0</td>\n<td class=\"text-center\">-7.5</td>\n<td class=\"text-center\">
Increasing</td>\n<td class=\"text-center\">Slower</td>\n<td class=\"text-center\">27</td>\n
</tr>\n<tr>
<!-- Table#-Row#-Column# -->\n<th scope=\"row\">Backlog of Orders</th>\n<td class=\"text-center\">53.0</td>
\n<td class=\"text-center\">51.3</td>\n<td class=\"text-center\"> 1.7</td>\n<td class=\"text-center\">
Growing</td>\n<td class=\"text-center\">Faster</td>\n<td class=\"text-center\">26</td>\n
</tr>\n<tr>
<!-- Table#-Row#-Column# -->\n<th scope=\"row\">New Export Orders</th>\n<td class=\"text-center\">49.4</td>
\n<td class=\"text-center\">52.6</td>\n<td class=\"text-center\">-3.2</td>\n<td class=\"text-center\">
Contracting</td>\n<td class=\"text-center\">From Growing</td>\n<td class=\"text-center\">1</td>\n
</tr>\n<tr>
<!-- Table#-Row#-Column# -->\n<th scope=\"row\">Imports</th>\n<td class=\"text-center\">52.5</td>\n<td
class=\"text-center\">54.4</td>\n<td class=\"text-center\">-1.9</td>\n<td class=\"text-center\">Growing
</td>\n<td class=\"text-center\">Slower</td>\n<td class=\"text-center\">3</td>\n
</tr>\n<tr>
<!-- Table#-Row#-Column# -->\n<th class=\"text-center\" colspan=\"4\" scope=\"row\">OVERALL ECONOMY</th>\n
<td class=\"text-center\">Growing</td>\n<td class=\"text-center\">Same</td>\n<td class=\"text-center\">27
</td>\n
</tr>\n<tr>
<!-- Table#-Row#-Column# -->\n<th class=\"text-center\" colspan=\"4\" scope=\"row\">Manufacturing Sector
</th>\n<td class=\"text-center\">Growing</td>\n<td class=\"text-center\">Same</td>\n<td
class=\"text-center\">27</td>\n
</tr>\n</tbody>\n</table>"]}