Home > Software design >  How to get the text from a BeautifulSoup html table scrape, .get_text() is returning an errors
How to get the text from a BeautifulSoup html table scrape, .get_text() is returning an errors

Time:02-22

I am trying to scrape a table from a website using python and BeautifulSoup (I am a bit new to both). The following is the code that I have tried so far which creates a list of the values in the first two columns.

However, when I try to get the strings (2016-01,2016-02,...) from the td list of Tcells1 or (1.4193,1.3826,...) from Tcells2, the .get_text() gives me errors. I am a bit new to Python and Beautifulsoup. I know Pandas can scrape html tables, but I want to learn BeautifulSoup and I don't know what I am doing wrong here. Python3.8.8

import requests 
from bs4 import BeautifulSoup as bs 

r = requests.get('https://www.nrcan.gc.ca/our-natural-resources/energy-sources-distribution/clean-fossil-fuels/crude-oil/oil-pricing/selected-crude-oil-prices-monthly-2016/17087') 

soup = bs(r.content,features='lxml') 

print(soup.prettify()) 

Trows = soup.find_all('tr') # this is all the table rows

Tcells1 = soup.find_all('td',attrs={"headers":"tbl6"})
Tcells2 = soup.find_all('td',attrs={"headers":"tbl7"})

CodePudding user response:

Another method to get the table data: just get the texts of all <td> cells and place it into a nested lists:

import requests
from bs4 import BeautifulSoup


url = "https://www.nrcan.gc.ca/our-natural-resources/energy-sources-distribution/clean-fossil-fuels/crude-oil/oil-pricing/selected-crude-oil-prices-monthly-2016/17087"
soup = BeautifulSoup(requests.get(url).content, "html.parser")

table = []
for row in soup.select("tr")[2:]:
    table.append([td.get_text() for td in row.select("td")])

column_names = [
    td.get_text(strip=True) for td in soup.select_one("tr").select("th")
]

print(column_names)
print(table)

Prints:

[
    "Month",
    "Exchange Rate",
    "Cdn Light SweetEdmonton",
    "Western Canada Select Hardisty",
    "Cdn Light Sweet Chicago",
    "WTIChicago",
    "Western Canada SelectChicago",
    "Brent Montreal",
]

[
    ["2016-01", "1.4193", "272", "123", "308", "300", "165", "358"],
    ["2016-02", "1.3826", "243", "193", "205", "276", "225", "324"],
    ["2016-03", "1.3232", "219", "154", "248", "302", "184", "325"],
    ["2016-04", "1.2819", "297", "219", "334", "345", "261", "371"],
    ["2016-05", "1.2946", "363", "282", "401", "397", "324", "410"],
    ["2016-06", "1.2892", "378", "293", "415", "411", "333", "425"],
    ["2016-07", "1.3064", "337", "251", "375", "383", "294", "401"],
    ["2016-08", "1,2996", "338", "248", "376", "381", "290", "397"],
    ["2016-09", "1.3109", "351", "259", "389", "388", "302", "402"],
    ["2016-10", "1.3245", "390", "299", "428", "432", "343", "441"],
    ["2016-11", "1.3432", "352", "258", "390", "402", "302", "412"],
    ["2016-12", "1.3342", "404", "306", "443", "454", "350", "474"],
    ["Average", "1.3245", "330", "241", "358", "373", "281", "393"],
]

Then you can construct pandas dataframe easily:

df = pd.DataFrame(table, columns=column_names)
print(df)

Prints:

      Month Exchange Rate Cdn Light SweetEdmonton Western Canada Select Hardisty Cdn Light Sweet Chicago WTIChicago Western Canada SelectChicago Brent Montreal
0   2016-01        1.4193                     272                            123                     308        300                          165            358
1   2016-02        1.3826                     243                            193                     205        276                          225            324
2   2016-03        1.3232                     219                            154                     248        302                          184            325
3   2016-04        1.2819                     297                            219                     334        345                          261            371
4   2016-05        1.2946                     363                            282                     401        397                          324            410
5   2016-06        1.2892                     378                            293                     415        411                          333            425
6   2016-07        1.3064                     337                            251                     375        383                          294            401
7   2016-08        1,2996                     338                            248                     376        381                          290            397
8   2016-09        1.3109                     351                            259                     389        388                          302            402
9   2016-10        1.3245                     390                            299                     428        432                          343            441
10  2016-11        1.3432                     352                            258                     390        402                          302            412
11  2016-12        1.3342                     404                            306                     443        454                          350            474
12  Average        1.3245                     330                            241                     358        373                          281            393

CodePudding user response:

Your code seems to work. Keep in mind that your objects, Tcells1, Tcells2 etc. are lists. This means you need to iterate through them to run the get_text() method on the objects within them, e.g.:

for cell in Tcells1:
    print(cell.get_text())

Output:

2016-01
2016-02
2016-03
2016-04
2016-05
2016-06
2016-07
2016-08
2016-09
2016-10
2016-11
2016-12
Average

Or more succinctly for Tcells2:

[cell.get_text() for cell in Tcells2]

Output:

['1.4193',
 '1.3826',
 '1.3232',
 '1.2819',
 '1.2946',
 '1.2892',
 '1.3064',
 '1,2996',
 '1.3109',
 '1.3245',
 '1.3432',
 '1.3342',
 '1.3245']

CodePudding user response:

You correctly find all td, which header=tb6 or header=tb7.

Then, Tcells1 and Tcells1 are list type variables such as:

[<td headers="tbl6">2016-01</td>, <td headers="tbl6">2016-02</td>, ...] # Tcells1
[<td headers="tbl7">1.4193</td>, <td headers="tbl7">1.3826</td>, ...] # Tcells2

Thus, you can get only text values from them, as follows:

import requests
from bs4 import BeautifulSoup as bs

r = requests.get('https://www.nrcan.gc.ca/our-natural-resources/energy-sources-distribution/clean-fossil-fuels/crude-oil/oil-pricing/selected-crude-oil-prices-monthly-2016/17087')

soup = bs(r.content,features='lxml')

Trows = soup.find_all('tr')  # this is all the table rows

Tcells1 = soup.find_all('td', attrs={"headers": "tbl6"})
Tcells2 = soup.find_all('td', attrs={"headers": "tbl7"})

Tcells1_values = [val.text for val in Tcells1[:-1]] # [:-1] to remove last value, "average" row
Tcells2_values = [val.text for val in Tcells2[:-1]] # [:-1] to remove last value, "average" row

print(Tcells1_values)
# ['2016-01', '2016-02', '2016-03', '2016-04', '2016-05', '2016-06', '2016-07', '2016-08', '2016-09', '2016-10', '2016-11', '2016-12']

print(Tcells2_values)
# ['1.4193', '1.3826', '1.3232', '1.2819', '1.2946', '1.2892', '1.3064', '1,2996', '1.3109', '1.3245', '1.3432', '1.3342']
  • Related