Home > Back-end >  "Benchmark" of different methods to get text extracted
"Benchmark" of different methods to get text extracted

Time:07-29

I´m trying to extract some scraping of a 221x7 table in selenium. Since my first approach takes approx. 3sec, i was wondering, what is the fastest way and best practice at the same moment.

1st: 3.6sec

table_content = driver_lsx_watchlist.find_elements(By.XPATH, '''//*[@id="page_content"]/div/div/div/div/module/div/table/tbody''')
table_content = table_content[0].text
table_content = table_content.splitlines()
for i in range(0, len(table_content)):
    print(f'{i} {table_content[i]}')

2nd: about 200sec!!!

for row in range(1, 222):
    row_text = ''
    for column in range (1,7):
        xpath = '''//*[@id="page_content"]/div/div/div/div/module/div/table/tbody/tr['''   str(row)   ''']/td['''   str(column)   ''']/div'''
        row_text = row_text   driver_lsx_watchlist.find_elements(By.XPATH, xpath)[0].text
    print(row_text)

3rd: a bit over 4sec

print(driver_lsx_watchlist.find_element(By.XPATH, "/html/body").text)

4th: 0.2sec

ActionChains(driver_lsx_watchlist)\
    .key_down(Keys.CONTROL)\
    .send_keys("a")\
    .key_up(Keys.CONTROL)\
    .key_down(Keys.CONTROL)\
    .send_keys("c")\
    .key_up(Keys.CONTROL)\
    .perform()

Since the clipboard seems to be the fastest of all, but renders my pc useless since the clipboard itself is occupied by the process, i wonder what the best practice would be and if i get a proper solution with under 1 second while using the very same pc.

CodePudding user response:

To scrape table within the webpage you need to induce WebDriverWait for the visibility_of_element_located() for the <table> element and using DataFrame from Pandas you can use the following Locator Strategy:

driver.execute("get", {'url': 'https://www.ls-x.de/de/watchlist'})
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button.btn.btn-primary.accept"))).click()
data = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//*[@id='page_content']/div/div/div/div/module/div/table"))).get_attribute("outerHTML")
df  = pd.read_html(data)
print(df)

Note: You have to add the following imports :

import pandas as pd
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

CodePudding user response:

You can try pandas:

import pandas as pd
[... rest of your selenium imports]

[... retrieve the page, locate the element]

dfs = pd.read_html([table located with full html, including thead etc].get_attribute('innerHTML'))
print(dfs[0])

This should return a dataframe, probably more usable than clipboard etc.

Of course, depending on the information you are after, there are other ways to retrieve the data from that website (including websockets, and api xhr calls).

Data is loading dynamically in that page, from a few api endpoints which can be seen in Network tab, Dev tools. You could for example retrieve data from this endpoint with requests, and transform it into a dataframe, depending on what data you're after:

import requests
import pandas as pd

r = requests.get('https://www.ls-x.de/_rpc/json/instrument/chart/dataForInstrument?container=chart9&instrumentId=70577&marketId=2&quotetype=mid&series=intraday&type=mini&localeId=2')
# print(r.json())
df = pd.DataFrame(r.json()['series']['intraday']['data']) 
print(df)

This would return a dataframe like:

0   1
0   1658991600000   102.515
1   1658991660000   102.515
2   1658991720000   102.555
3   1658991780000   102.545
4   1658991840000   102.525
... ... ...
924 1659047700000   102.350
925 1659047760000   102.335
926 1659047820000   102.315
927 1659047880000   102.350
928 1659047940000   102.340

There are several api endpoints being accessed to bring information into that page (all GET, nothing fancy) returning detailed JSON data. You could have a look at each of them, and see if you can get your information from there.

  • Related