I'm trying to get all information from this website using Python/Selenium: https://bitinfocharts.com/top-100-richest-bitcoin-addresses.html
I have successfully get all info but the problem is that it has 100 elements and I only get the first 19 (the elements who can see in Chromium window when page is loaded first time).
I tried to scroll down the page like this:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
driver.find_element_by_tag_name('body').send_keys(Keys.END)
driver.find_element_by_tag_name('body').send_keys(Keys.PAGE_DOWN)
etc.. And it's working but nothing change. I'm still getting only 19 elements of 100. I Tryed also to change params like windows size, headless, maximized...etc...And nothing change.
chrome_driver_binary = "C:\\scraping\\selenium\\chromedriver.exe"
options = webdriver.ChromeOptions()
options.add_argument('--lang=en')
options.add_argument("--disable-extensions")
options.add_argument('--headless')
options.add_argument('--window-size=1920x1480')
options.binary_location = "C:\\Program Files (x86)\\BraveSoftware\\Brave-Browser\\Application\\brave.exe"
If I edit manually in the chrome window created by Selenium the main code, I can view that there is all the elements here. I can see too how window scroll down well to the bottom.
So, where is the problem?
That's the main code ho capture all teh data successfully (but only 19 first elements). I put it just in case it's important.
TABLE_RESULT_BTC_TOP100 = soup1.find('table', id="tblOne").find('tbody')
for tr_tag in TABLE_RESULT_BTC_TOP100.find_all('tr'):
If not a solution, any clue will be appreciate :)
CodePudding user response:
The first 19
rows are in one table, then the following ones are in another table. You have to grab both tables.
Also, no need for selenium
.
Here's how to get all 100
rows.
import requests
import pandas as pd
url = "https://bitinfocharts.com/top-100-richest-bitcoin-addresses.html"
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:99.0) Gecko/20100101 Firefox/99.0",
}
df = pd.read_html(requests.get(url, headers=headers).text, flavor="lxml")[2:4]
df = pd.concat(df)
print(df)
Output:
0 1 ... Outs Unnamed: 0
0 NaN NaN ... 449.0 1.0
1 NaN NaN ... 78.0 2.0
2 NaN NaN ... 77.0 3.0
3 NaN NaN ... NaN 4.0
4 NaN NaN ... NaN 5.0
.. ... ... ... ... ...
76 96.0 bc1qxv55wuzz4qsfgss3uq2zwg5y88d7qv5hg67d2d ... NaN NaN
77 97.0 bc1qmjpguunz9lc7h6zf533wtjc70ync94ptnrjqmk ... NaN NaN
78 98.0 bc1qyr9dsfyst3epqycghpxshfmgy8qfzadfhp8suk ... NaN NaN
79 99.0 bc1q8qg2eazryu9as20k3hveuvz43thp200g7nw7qy ... NaN NaN
80 100.0 bc1q4ffskt6879l4ewffrrflpykvphshl9la4q037h ... NaN NaN
[100 rows x 20 columns]