First post - be gentle!
I am starting to learn Python and would like to get information from a table in a web page (https://en.wikipedia.org/wiki/European_Union#Demographics) in to a panda.
I am using Google Colab and from researching a bit I understand the process has something to do with 'web scraping' turning HTML in to .CSV.
Any thoughts welcome please. Worth noting I am constrained by not being able to download additional software due to the secure nature of my work.
Thanks.
CodePudding user response:
You need a library to help you parse the HTML - a well known library for that in Python would be BeautifulSoup.
There are also some available tools online that do this kind of thing for you, and you can take some inspiration from them, even if you can't use them directly: https://wikitable2csv.ggor.de/ As you see this website above use the CSS "table.wikitable" to identify the tables.
CodePudding user response:
You can use Scrapy, a python based scraping framework to get and parse the data as required. In Scrapy, you can create spiders which crawl a set of urls which you have initialized. Furthermore, you can parse the HTML data using something like Beautiful Soup to get your table from the response. The Scrapy documentation in itself is pretty useful and should get you through to set it up quickly! Scrapy also let you export the parsed data as CSV which should help you with the export part.
All the best!