Home > Software engineering >  How to extract a table to csv using Python from website without having table id
How to extract a table to csv using Python from website without having table id

Time:09-23

I am trying to make a csv in a daily basis from a specific website table: https://lunarcrush.com/exchanges

I've tried to use every single piece of advice on the related topics here (eg. How to extract tables from websites in Python , Python Extract Table from URL to csv , extract a html table data to csv file and many many more)

I thought that my initial problem was that I didn't have the table id (such as in other examples, I've only found the (table) class name MuiTable-root. But after a little more digging up I found out that whenever I was reading the url, the HTML code I was getting was completely different, rather than the one I see when I use Inspect(O) click on my browser.

I've tried almost everything I found here, so I am not sure if it helps to quote every singe code. As an example I just quote the following, that I was trying to make it work. The idea is simple (to find the tr part of the table and get the th (header) and td (data), and after that I'd extract them to a csv.

from lxml import etree
import urllib.request

web = urllib.request.urlopen("https://lunarcrush.com/exchanges")
s = web.read()

html = etree.HTML(s)

## Get all 'tr'
tr_nodes = html.xpath('//table[@class="MuiTableHead-root"]/tr')

## 'th' is inside first 'tr'
header = [i[0].text for i in tr_nodes[0].xpath("th")]

## Get text from rest all 'tr'
td_content = [[td.text for td in tr.xpath('td')] for tr in tr_nodes[1:]]

print(td_content)

Any ideas? I am sorry for my long (and maybe silly) question, I am just starting to use python, and there are still lots to learn!

CodePudding user response:

Use pandas to collect a dataframe and selenium to populate it.

You can install them at the terminal typing:

pip install pandas
pip install selenium
pip install webdriver-manager

More information about selenium can be found: https://selenium-python.readthedocs.io/ and pandas https://pandas.pydata.org/docs/

More info about the installation and drivers can be found: https://selenium-python.readthedocs.io/installation.html

Usually you will download the driver and make it run as you can see in the documentation. But the webdriver_manager will do something similar automatically.

driver = webdriver.Chrome(ChromeDriverManager().install())

In the code you need to import the packages you installed, i.e. pandas and selenium.

from selenium import webdriver 
from webdriver_manager.chrome import ChromeDriverManager
import pandas as pd

You instantiate your webdriver (open chrome controlled by python). in the variable "driver". You find the xpath of your page with driver "find_elements_by_xpath" and extract an attribute from it (when necessary).

xpath = '//tbody//tr'
driver.find_elements_by_xpath(xpath)
row = [ i.text for i in driver.find_elements_by_xpath(xpath)]

Finally you make a list out of the content you will find, define a dictionary to save the data in a pandas dataframe and export it to your csv file:

dictionary = {'row ': row}
df = pd.DataFrame(dictionary)
df.to_csv("filename")

The whole thing should look like this:

from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
import pandas as pd

driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get("https://lunarcrush.com/exchanges")

xpath = '//tbody//tr'
row = [i.text for i in driver.find_elements_by_xpath(xpath)]

# notice that i.text is not always necessary depends on the attribute of the html element.
# In your case you may want to include some more things like the header and the index and also split the row look .str.split('') at pandas
dictionary = {'row ': row}
df = pd.DataFrame(dictionary)
df.to_csv("filename")
  • Related