Home > Software engineering >  Trying to scrape data from website using Pandas
Trying to scrape data from website using Pandas

Time:10-18

I have been trying to scrape table information from this website - https://ethernodes.org/

Watched several tutorials and got stuck at this point.

import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver

driver = webdriver.Chrome(executable_path='/Users/a130/Downloads/chromedriver')
driver.get('https://ethernodes.org/')

this code returns the following error.

WebDriverException: Message: 'chromedriver' executable needs to be in PATH. Please see https://chromedriver.chromium.org/home

It was not clear how to solve this issue, so I am asking your help. I am using MacOS!

CodePudding user response:

url = 'https://ethernodes.org'
dfs = pd.read_html(url)

CodePudding user response:

Chromedriver needs to be in your PATH. Check out user 'import random's reply to your post or look here: selenium - chromedriver executable needs to be in PATH

As for scraping the tables from that website, I don't think you can use pd.read_html because the tables aren't stored inside of a table tag, they're stored inside of ul and li tags. Here's some dirty code that seems to work for your use case. Note that you have to change the url in browser.get(url) to extract each of the individual tables, or you could make into it a command line argument:

from selenium import webdriver
from selenium.webdriver.common.by import By
import pandas as pd

browser = webdriver.Chrome()
browser.get('https://ethernodes.org/sync')
#Get all the list tags, which is where the table data is stored for this website
lst = browser.find_elements(By.TAG_NAME, 'li')
data = [[]]
df = pd.DataFrame()
#Don't include the navigation links so start the for loop at index 9
for i in range(9, len(lst)):
    tmp = lst[i].get_attribute("innerText")
    #There are more tags within each list element, splitting on \n gives us the data with column name at index 0 and column value at index 1
    tmp2 = tmp.split("\n")
    #Add column names to a single list
    data[0].append(tmp2[0])
    #Add column values to separate lists consisting of only one value
    data.append([tmp2[1]])
for j in range(1, len(data)):
    #Assign dataframe column to its associated value list
    df[data[0][j-1]] = data[j]
print(df)
  • Related