I am trying to scrape a table from a website using python and BeautifulSoup (I am a bit new to both). The following is the code that I have tried so far which creates a list of the values in the first two columns.
However, when I try to get the strings (2016-01,2016-02,...) from the td list of Tcells1 or (1.4193,1.3826,...) from Tcells2, the .get_text() gives me errors. I am a bit new to Python and Beautifulsoup. I know Pandas can scrape html tables, but I want to learn BeautifulSoup and I don't know what I am doing wrong here. Python3.8.8
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.nrcan.gc.ca/our-natural-resources/energy-sources-distribution/clean-fossil-fuels/crude-oil/oil-pricing/selected-crude-oil-prices-monthly-2016/17087')
soup = bs(r.content,features='lxml')
print(soup.prettify())
Trows = soup.find_all('tr') # this is all the table rows
Tcells1 = soup.find_all('td',attrs={"headers":"tbl6"})
Tcells2 = soup.find_all('td',attrs={"headers":"tbl7"})
CodePudding user response:
Another method to get the table data: just get the texts of all <td>
cells and place it into a nested lists:
import requests
from bs4 import BeautifulSoup
url = "https://www.nrcan.gc.ca/our-natural-resources/energy-sources-distribution/clean-fossil-fuels/crude-oil/oil-pricing/selected-crude-oil-prices-monthly-2016/17087"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
table = []
for row in soup.select("tr")[2:]:
table.append([td.get_text() for td in row.select("td")])
column_names = [
td.get_text(strip=True) for td in soup.select_one("tr").select("th")
]
print(column_names)
print(table)
Prints:
[
"Month",
"Exchange Rate",
"Cdn Light SweetEdmonton",
"Western Canada Select Hardisty",
"Cdn Light Sweet Chicago",
"WTIChicago",
"Western Canada SelectChicago",
"Brent Montreal",
]
[
["2016-01", "1.4193", "272", "123", "308", "300", "165", "358"],
["2016-02", "1.3826", "243", "193", "205", "276", "225", "324"],
["2016-03", "1.3232", "219", "154", "248", "302", "184", "325"],
["2016-04", "1.2819", "297", "219", "334", "345", "261", "371"],
["2016-05", "1.2946", "363", "282", "401", "397", "324", "410"],
["2016-06", "1.2892", "378", "293", "415", "411", "333", "425"],
["2016-07", "1.3064", "337", "251", "375", "383", "294", "401"],
["2016-08", "1,2996", "338", "248", "376", "381", "290", "397"],
["2016-09", "1.3109", "351", "259", "389", "388", "302", "402"],
["2016-10", "1.3245", "390", "299", "428", "432", "343", "441"],
["2016-11", "1.3432", "352", "258", "390", "402", "302", "412"],
["2016-12", "1.3342", "404", "306", "443", "454", "350", "474"],
["Average", "1.3245", "330", "241", "358", "373", "281", "393"],
]
Then you can construct pandas dataframe easily:
df = pd.DataFrame(table, columns=column_names)
print(df)
Prints:
Month Exchange Rate Cdn Light SweetEdmonton Western Canada Select Hardisty Cdn Light Sweet Chicago WTIChicago Western Canada SelectChicago Brent Montreal
0 2016-01 1.4193 272 123 308 300 165 358
1 2016-02 1.3826 243 193 205 276 225 324
2 2016-03 1.3232 219 154 248 302 184 325
3 2016-04 1.2819 297 219 334 345 261 371
4 2016-05 1.2946 363 282 401 397 324 410
5 2016-06 1.2892 378 293 415 411 333 425
6 2016-07 1.3064 337 251 375 383 294 401
7 2016-08 1,2996 338 248 376 381 290 397
8 2016-09 1.3109 351 259 389 388 302 402
9 2016-10 1.3245 390 299 428 432 343 441
10 2016-11 1.3432 352 258 390 402 302 412
11 2016-12 1.3342 404 306 443 454 350 474
12 Average 1.3245 330 241 358 373 281 393
CodePudding user response:
Your code seems to work. Keep in mind that your objects, Tcells1
, Tcells2
etc. are lists. This means you need to iterate through them to run the get_text()
method on the objects within them, e.g.:
for cell in Tcells1:
print(cell.get_text())
Output:
2016-01
2016-02
2016-03
2016-04
2016-05
2016-06
2016-07
2016-08
2016-09
2016-10
2016-11
2016-12
Average
Or more succinctly for Tcells2
:
[cell.get_text() for cell in Tcells2]
Output:
['1.4193',
'1.3826',
'1.3232',
'1.2819',
'1.2946',
'1.2892',
'1.3064',
'1,2996',
'1.3109',
'1.3245',
'1.3432',
'1.3342',
'1.3245']
CodePudding user response:
You correctly find all td
, which header=tb6
or header=tb7
.
Then, Tcells1
and Tcells1
are list type variables such as:
[<td headers="tbl6">2016-01</td>, <td headers="tbl6">2016-02</td>, ...] # Tcells1
[<td headers="tbl7">1.4193</td>, <td headers="tbl7">1.3826</td>, ...] # Tcells2
Thus, you can get only text values from them, as follows:
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.nrcan.gc.ca/our-natural-resources/energy-sources-distribution/clean-fossil-fuels/crude-oil/oil-pricing/selected-crude-oil-prices-monthly-2016/17087')
soup = bs(r.content,features='lxml')
Trows = soup.find_all('tr') # this is all the table rows
Tcells1 = soup.find_all('td', attrs={"headers": "tbl6"})
Tcells2 = soup.find_all('td', attrs={"headers": "tbl7"})
Tcells1_values = [val.text for val in Tcells1[:-1]] # [:-1] to remove last value, "average" row
Tcells2_values = [val.text for val in Tcells2[:-1]] # [:-1] to remove last value, "average" row
print(Tcells1_values)
# ['2016-01', '2016-02', '2016-03', '2016-04', '2016-05', '2016-06', '2016-07', '2016-08', '2016-09', '2016-10', '2016-11', '2016-12']
print(Tcells2_values)
# ['1.4193', '1.3826', '1.3232', '1.2819', '1.2946', '1.2892', '1.3064', '1,2996', '1.3109', '1.3245', '1.3432', '1.3342']