How to extract a specific table from an url via Python?-CodePudding

I am currently looking at the following Link:

https://www.ishares.com/de/privatanleger/de/produkte/251931/ishares-stoxx-europe-600-ucits-etf-de-fund

There is a table that displays all Positions of the ETF. My goal is to extract the table and save it to a xlsx file. I wrote a code:

import requests
import pandas as pd

url = 'https://www.ishares.com/de/privatanleger/de/produkte/251931/ishares-stoxx-europe-600-ucits-etf-de-fund'
html = requests.get(url).content
df_list = pd.read_html(html)
df = df_list[-1]
print(df)
df.to_excel('my data.xlsx')

However, when I try pd.read_html(html) it show me always that no tables could have been found on the website. Does somebody know how to identify and pull the desired table via Python?

CodePudding user response：

The problem is that the website uses cookies, so, when you use the default link, it redirects you to a first page that needs you to click in a button to accept cookies. And then you go to the right page. I encountered the right URL, that goes straight to the main page that you are looking for, try this:

url = 'https://www.ishares.com/de/privatanleger/de/produkte/251931/ishares-stoxx-europe-600-ucits-etf-de-fund?switchLocale=y&siteEntryPassthrough=true'

html = requests.get(url).content
df_list = pd.read_html(html)
print(df_list)

Here is my Output:

[Empty DataFrame
Columns: [Ex-Tag, Fälligkeitsdatum, Gesamtausschüttung]
Index: [], Empty DataFrame
Columns: [Ex-Tag, Fälligkeitsdatum, Gesamtausschüttung]
Index: [],             Unnamed: 0  2012  2013  2014  2015  ...  2017  2018  2019  2020  2021
0    Gesamtrendite (%)   177   210    74   108  ...   108  -110   276   -19   251
1  Vergleichsindex (%)   178   212    72    96  ...   106  -108   268   -20   249

[2 rows x 11 columns],                             Unnamed: 0  ...  Von 31.Mär.2021Bis 31.Mär.2022
0    Gesamtrendite (%) Per 31.Mär.2022  ...                             863
1  Vergleichsindex (%) Per 31.Mär.2022  ...                             849

[2 rows x 6 columns],             Unnamed: 0   1J   3J   5J  10J  Seit Auflage
0    Gesamtrendite (%)  197  894  538  940           651
1  Vergleichsindex (%)  178  879  522  920           636,             Unnamed: 0  Seit 1.1.  1M  3M  ...    3J    5J    10J  Seit Auflage
0    Gesamtrendite (%)       -736 -80 -43  ...  2929  2994  14548         21694
1  Vergleichsindex (%)       -755 -92 -65  ...  2874  2896  14119         20904

[2 rows x 10 columns], Empty DataFrame
Columns: [Emittententicker, Name, Sektor, Anlageklasse, Marktwert, Gewichtung (%), Nominalwert, Nominale, ISIN, Kurs, Standort, Börse, Marktwährung]
Index: [], Empty DataFrame
Columns: [Kategorie, Fonds]
Index: [], Empty DataFrame
Columns: [Kategorie, Fonds]
Index: [],                        Börse Ticker  ... Common Code (EOC) iNAV ISIN
0                      Xetra   EXSA  ...        186 794 77         -
1  Bolsa Mexicana De Valores   EXSA  ...                 -         -
2             Borsa Italiana   EXSA  ...                 -         -
3         SIX Swiss Exchange   EXSA  ...                 -         -

[4 rows x 14 columns]]

Process finished with exit code 0

CodePudding user response：

This is how you would use Selenium:

from Selenium import webdriver  
import time  
from Selenium.webdriver.common.keys import Keys  
driver = webdriver.Chrome()  
#maximize the window size  
driver.maximize_window()  
#navigate to the url  
driver.get("https://www.google.com/")