How to get the text from a BeautifulSoup html table scrape, .get_text() is returning an errors-CodePudding

I am trying to scrape a table from a website using python and BeautifulSoup (I am a bit new to both). The following is the code that I have tried so far which creates a list of the values in the first two columns.

However, when I try to get the strings (2016-01,2016-02,...) from the td list of Tcells1 or (1.4193,1.3826,...) from Tcells2, the .get_text() gives me errors. I am a bit new to Python and Beautifulsoup. I know Pandas can scrape html tables, but I want to learn BeautifulSoup and I don't know what I am doing wrong here. Python3.8.8

import requests 
from bs4 import BeautifulSoup as bs 

r = requests.get('https://www.nrcan.gc.ca/our-natural-resources/energy-sources-distribution/clean-fossil-fuels/crude-oil/oil-pricing/selected-crude-oil-prices-monthly-2016/17087') 

soup = bs(r.content,features='lxml') 

print(soup.prettify()) 

Trows = soup.find_all('tr') # this is all the table rows

Tcells1 = soup.find_all('td',attrs={"headers":"tbl6"})
Tcells2 = soup.find_all('td',attrs={"headers":"tbl7"})

CodePudding user response：

Another method to get the table data: just get the texts of all <td> cells and place it into a nested lists:

import requests
from bs4 import BeautifulSoup


url = "https://www.nrcan.gc.ca/our-natural-resources/energy-sources-distribution/clean-fossil-fuels/crude-oil/oil-pricing/selected-crude-oil-prices-monthly-2016/17087"
soup = BeautifulSoup(requests.get(url).content, "html.parser")

table = []
for row in soup.select("tr")[2:]:
    table.append([td.get_text() for td in row.select("td")])

column_names = [
    td.get_text(strip=True) for td in soup.select_one("tr").select("th")
]

print(column_names)
print(table)

Prints:

[
    "Month",
    "Exchange Rate",
    "Cdn Light SweetEdmonton",
    "Western Canada Select Hardisty",
    "Cdn Light Sweet Chicago",
    "WTIChicago",
    "Western Canada SelectChicago",
    "Brent Montreal",
]

[
    ["2016-01", "1.4193", "272", "123", "308", "300", "165", "358"],
    ["2016-02", "1.3826", "243", "193", "205", "276", "225", "324"],
    ["2016-03", "1.3232", "219", "154", "248", "302", "184", "325"],
    ["2016-04", "1.2819", "297", "219", "334", "345", "261", "371"],
    ["2016-05", "1.2946", "363", "282", "401", "397", "324", "410"],
    ["2016-06", "1.2892", "378", "293", "415", "411", "333", "425"],
    ["2016-07", "1.3064", "337", "251", "375", "383", "294", "401"],
    ["2016-08", "1,2996", "338", "248", "376", "381", "290", "397"],
    ["2016-09", "1.3109", "351", "259", "389", "388", "302", "402"],
    ["2016-10", "1.3245", "390", "299", "428", "432", "343", "441"],
    ["2016-11", "1.3432", "352", "258", "390", "402", "302", "412"],
    ["2016-12", "1.3342", "404", "306", "443", "454", "350", "474"],
    ["Average", "1.3245", "330", "241", "358", "373", "281", "393"],
]

Then you can construct pandas dataframe easily:

df = pd.DataFrame(table, columns=column_names)
print(df)

Prints:

      Month Exchange Rate Cdn Light SweetEdmonton Western Canada Select Hardisty Cdn Light Sweet Chicago WTIChicago Western Canada SelectChicago Brent Montreal
0   2016-01        1.4193                     272                            123                     308        300                          165            358
1   2016-02        1.3826                     243                            193                     205        276                          225            324
2   2016-03        1.3232                     219                            154                     248        302                          184            325
3   2016-04        1.2819                     297                            219                     334        345                          261            371
4   2016-05        1.2946                     363                            282                     401        397                          324            410
5   2016-06        1.2892                     378                            293                     415        411                          333            425
6   2016-07        1.3064                     337                            251                     375        383                          294            401
7   2016-08        1,2996                     338                            248                     376        381                          290            397
8   2016-09        1.3109                     351                            259                     389        388                          302            402
9   2016-10        1.3245                     390                            299                     428        432                          343            441
10  2016-11        1.3432                     352                            258                     390        402                          302            412
11  2016-12        1.3342                     404                            306                     443        454                          350            474
12  Average        1.3245                     330                            241                     358        373                          281            393

CodePudding user response：

Your code seems to work. Keep in mind that your objects, Tcells1, Tcells2 etc. are lists. This means you need to iterate through them to run the get_text() method on the objects within them, e.g.:

for cell in Tcells1:
    print(cell.get_text())

Output:

Or more succinctly for Tcells2:

[cell.get_text() for cell in Tcells2]

Output:

['1.4193',
 '1.3826',
 '1.3232',
 '1.2819',
 '1.2946',
 '1.2892',
 '1.3064',
 '1,2996',
 '1.3109',
 '1.3245',
 '1.3432',
 '1.3342',
 '1.3245']

CodePudding user response：

You correctly find all td, which header=tb6 or header=tb7.

Then, Tcells1 and Tcells1 are list type variables such as:

[<td headers="tbl6">2016-01</td>, <td headers="tbl6">2016-02</td>, ...] # Tcells1
[<td headers="tbl7">1.4193</td>, <td headers="tbl7">1.3826</td>, ...] # Tcells2

Thus, you can get only text values from them, as follows:

import requests
from bs4 import BeautifulSoup as bs

r = requests.get('https://www.nrcan.gc.ca/our-natural-resources/energy-sources-distribution/clean-fossil-fuels/crude-oil/oil-pricing/selected-crude-oil-prices-monthly-2016/17087')

soup = bs(r.content,features='lxml')

Trows = soup.find_all('tr')  # this is all the table rows

Tcells1 = soup.find_all('td', attrs={"headers": "tbl6"})
Tcells2 = soup.find_all('td', attrs={"headers": "tbl7"})

Tcells1_values = [val.text for val in Tcells1[:-1]] # [:-1] to remove last value, "average" row
Tcells2_values = [val.text for val in Tcells2[:-1]] # [:-1] to remove last value, "average" row

print(Tcells1_values)
# ['2016-01', '2016-02', '2016-03', '2016-04', '2016-05', '2016-06', '2016-07', '2016-08', '2016-09', '2016-10', '2016-11', '2016-12']

print(Tcells2_values)
# ['1.4193', '1.3826', '1.3232', '1.2819', '1.2946', '1.2892', '1.3064', '1,2996', '1.3109', '1.3245', '1.3432', '1.3342']