I am not able to Scrape Table content with MIME format of data:application/octet-stream using python-CodePudding

I am trying to scrape some data from website, but the data is contained in an Iframe. Initially I scraped the source link but from the source also I am not able to scrape the data. I need help how to extract the data from this source link. Here is the source link: https://chartviewer-europublic.bigapis.net/nzgaV/index.html

Also I am sharing the screenshot here showing the download button url of the data under "a" tag but I am not able to extract this link also.

enter image description here

Here is the code I have used. I have used BeautifulSoup for the scraping.

# Libraries

from bs4 import BeautifulSoup
import requests

# Original website link
url_spain_total="https://anfac.com/cifras-clave/matriculaciones-turismos-y-todoterreno/"

page_total=requests.get(url_spain_total).text

soup_spain_total=BeautifulSoup(page_total,"lxml")

print(soup_spain_total.prettify())

# Getting the list of links in the iframe
result_spain=soup_spain_total.find_all("iframe")
result_spain

# Getting the required source link
total_main_link=result_spain[1]["src"]
total_main_link

After getting the source link, I am not able to extract the table contents.

Any help is appreciated. Thanks in Advance!

CodePudding user response：

The following is an example of how you can get that data using selenium:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd

chrome_options = Options()
chrome_options.add_argument("--no-sandbox")
# chrome_options.add_argument("--headless")
chrome_options.add_argument('disable-notifications')
chrome_options.add_argument("window-size=1920,1080")

webdriver_service = Service("chromedriver/chromedriver") ## path to where you saved chromedriver binary
browser = webdriver.Chrome(service=webdriver_service, options=chrome_options)
wait = WebDriverWait(browser, 20)
url = ' https://chartviewer-europublic.bigapis.net/nzgaV/index.html'
browser.get(url) 
table = wait.until(EC.element_to_be_clickable((By.ID, "datatable")))
df = pd.read_html(table.get_attribute("outerHTML"))[0]
print(df)

This will get the information as a dataframe, and display it in terminal:

	Categoría	Ago-22	Ago-21	% Variacion	Acumulado 2022	Acumulado 2021	% Variacion Acumulado
0	Gasolina	22.3402	20.0702	11311.31	231.348	279.89	-17-17.34
1	Diesel	8.9639	8.06481	11211.15	92.9799	119.641	-22-22.29
2	Resto	20.6042	19.4492	595.94	208.715	188.782	1110.56
3	Total combustibles	51.9075	47.5835	919.09	533.043	588.314	-9-9.39
4	Particular	24.9512	26.0833	-4,3-4.34	233.413	236.728	-1-1.4
5	Empresa	21.7122	17.6732	22922.85	224.337	215.654	44.03
6	Alquiler	5.24452	3.82738	37037.03	75.2928	135.931	-45-44.61
7	Total canales	51.9075	47.5835	919.09	533.043	588.314	-9-9.39

The selenium setup is for linux. However, if you would just peruse the questions on Selenium on this forum, you would find countless examples of selenium/chromedriver setups for Windows, if you are using Windows (or for Mac, for that matter).

Also, Selenium documentation is helpful: https://www.selenium.dev/documentation/webdriver/getting_started/