Home > Software design >  How to scrap a table from a website without a specific tag?
How to scrap a table from a website without a specific tag?

Time:09-07

I'm developing a python script to scrape data from a specific site: enter image description here

I would like to scrap the above data and have tried this way

import requests
from bs4 import BeautifulSoup
res=requests.get("https://www.abs.gov.au/statistics/economy/price-indexes-and-inflation/residential-property-price-indexes-eight-capital-cities/latest-release")
soup=BeautifulSoup(res.text,"html.parser")
table=soup.find_all('table', class_='chart-data-table has-chart responsive-enabled double-headers')
print(table.get_text())

Once I tried to run it, error occurred:

Traceback (most recent call last):
  File "/Users/ryanngan/PycharmProjects/Webscraping/seek.py", line 6, in <module>
    print(table.get_text())
  File "/Users/ryanngan/PycharmProjects/Webscraping/venv/lib/python3.9/site-packages/bs4/element.py", line 2289, in __getattr__
    raise AttributeError(
AttributeError: ResultSet object has no attribute 'get_text'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?

How can I scrap the above data in the website? Am I referring to a wrong tag? Thank you very much.

CodePudding user response:

The easiest way is to use pandas.read_html:

import requests
import pandas as pd
from bs4 import BeautifulSoup

url = "https://www.abs.gov.au/statistics/economy/price-indexes-and-inflation/residential-property-price-indexes-eight-capital-cities/latest-release#residential-property-price-indexes"

soup = BeautifulSoup(requests.get(url).content, "html.parser")
table = soup.select_one(
    'table:has(caption:-soup-contains("Residential Property Price Indexes, capital cities"))'
)

# parse the table manually or use pandas.read_html
df = pd.read_html(str(table))[0]
print(df)

Prints:

   Unnamed: 0  Sydney  Melbourne  Brisbane  Adelaide  Perth  Hobart  Darwin  Canberra  Weighted average of eight capital cities
0      Dec-11    98.4      100.0     100.2     100.7   99.4   101.9    98.2     100.9                                      99.4
1      Mar-12   100.3       99.4     100.0      99.3  100.5    99.4   100.8     100.8                                     100.0
2      Jun-12   101.4       99.3      99.9      99.6  101.0    98.2   104.1      99.5                                     100.4
3      Sep-12   100.9       98.6     100.8      99.2  102.1    98.1   105.5      99.5                                     100.2
4      Dec-12   103.7      100.4     101.7     100.2  105.2    98.4   107.8     101.8                                     102.4
5      Mar-13   104.7      100.8     101.9      99.8  107.5   100.0   109.6     100.3                                     103.1
6      Jun-13   108.7      102.7     103.2     100.9  110.6   100.0   111.0     101.0                                     105.7

...

Manual parsing:

header = [th.text for th in table.thead.select("th")]
print(*header, sep="\t")
for row in table.tbody.select("tr"):
    tds = [td.text for td in row.select("th, td")]
    print(*tds, sep="\t")

CodePudding user response:

find_all() returns an array of elements. You should go through all of them and select that one you are need. And than call get_text()

for el in table:
   print el.get_text()
  • Related