I´m trying to extract some scraping of a 221x7 table in selenium. Since my first approach takes approx. 3sec, i was wondering, what is the fastest way and best practice at the same moment.
1st: 3.6sec
table_content = driver_lsx_watchlist.find_elements(By.XPATH, '''//*[@id="page_content"]/div/div/div/div/module/div/table/tbody''')
table_content = table_content[0].text
table_content = table_content.splitlines()
for i in range(0, len(table_content)):
print(f'{i} {table_content[i]}')
2nd: about 200sec!!!
for row in range(1, 222):
row_text = ''
for column in range (1,7):
xpath = '''//*[@id="page_content"]/div/div/div/div/module/div/table/tbody/tr[''' str(row) ''']/td[''' str(column) ''']/div'''
row_text = row_text driver_lsx_watchlist.find_elements(By.XPATH, xpath)[0].text
print(row_text)
3rd: a bit over 4sec
print(driver_lsx_watchlist.find_element(By.XPATH, "/html/body").text)
4th: 0.2sec
ActionChains(driver_lsx_watchlist)\
.key_down(Keys.CONTROL)\
.send_keys("a")\
.key_up(Keys.CONTROL)\
.key_down(Keys.CONTROL)\
.send_keys("c")\
.key_up(Keys.CONTROL)\
.perform()
Since the clipboard seems to be the fastest of all, but renders my pc useless since the clipboard itself is occupied by the process, i wonder what the best practice would be and if i get a proper solution with under 1 second while using the very same pc.
CodePudding user response:
To scrape table within the webpage you need to induce WebDriverWait for the visibility_of_element_located() for the <table>
element and using DataFrame from Pandas you can use the following Locator Strategy:
driver.execute("get", {'url': 'https://www.ls-x.de/de/watchlist'})
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button.btn.btn-primary.accept"))).click()
data = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//*[@id='page_content']/div/div/div/div/module/div/table"))).get_attribute("outerHTML")
df = pd.read_html(data)
print(df)
Note: You have to add the following imports :
import pandas as pd
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
CodePudding user response:
You can try pandas:
import pandas as pd
[... rest of your selenium imports]
[... retrieve the page, locate the element]
dfs = pd.read_html([table located with full html, including thead etc].get_attribute('innerHTML'))
print(dfs[0])
This should return a dataframe, probably more usable than clipboard etc.
Of course, depending on the information you are after, there are other ways to retrieve the data from that website (including websockets, and api xhr calls).
Data is loading dynamically in that page, from a few api endpoints which can be seen in Network tab, Dev tools. You could for example retrieve data from this endpoint with requests, and transform it into a dataframe, depending on what data you're after:
import requests
import pandas as pd
r = requests.get('https://www.ls-x.de/_rpc/json/instrument/chart/dataForInstrument?container=chart9&instrumentId=70577&marketId=2"etype=mid&series=intraday&type=mini&localeId=2')
# print(r.json())
df = pd.DataFrame(r.json()['series']['intraday']['data'])
print(df)
This would return a dataframe like:
0 1
0 1658991600000 102.515
1 1658991660000 102.515
2 1658991720000 102.555
3 1658991780000 102.545
4 1658991840000 102.525
... ... ...
924 1659047700000 102.350
925 1659047760000 102.335
926 1659047820000 102.315
927 1659047880000 102.350
928 1659047940000 102.340
There are several api endpoints being accessed to bring information into that page (all GET, nothing fancy) returning detailed JSON data. You could have a look at each of them, and see if you can get your information from there.