Home > Mobile >  I am not able to Scrape Table content with MIME format of data:application/octet-stream using python
I am not able to Scrape Table content with MIME format of data:application/octet-stream using python

Time:09-10

I am trying to scrape some data from website, but the data is contained in an Iframe. Initially I scraped the source link but from the source also I am not able to scrape the data. I need help how to extract the data from this source link. Here is the source link: https://chartviewer-europublic.bigapis.net/nzgaV/index.html

Also I am sharing the screenshot here showing the download button url of the data under "a" tag but I am not able to extract this link also.

enter image description here

Here is the code I have used. I have used BeautifulSoup for the scraping.

# Libraries

from bs4 import BeautifulSoup
import requests

# Original website link
url_spain_total="https://anfac.com/cifras-clave/matriculaciones-turismos-y-todoterreno/"

page_total=requests.get(url_spain_total).text

soup_spain_total=BeautifulSoup(page_total,"lxml")

print(soup_spain_total.prettify())

# Getting the list of links in the iframe
result_spain=soup_spain_total.find_all("iframe")
result_spain

# Getting the required source link
total_main_link=result_spain[1]["src"]
total_main_link

After getting the source link, I am not able to extract the table contents.

Any help is appreciated. Thanks in Advance!

CodePudding user response:

The following is an example of how you can get that data using selenium:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd

chrome_options = Options()
chrome_options.add_argument("--no-sandbox")
# chrome_options.add_argument("--headless")
chrome_options.add_argument('disable-notifications')
chrome_options.add_argument("window-size=1920,1080")

webdriver_service = Service("chromedriver/chromedriver") ## path to where you saved chromedriver binary
browser = webdriver.Chrome(service=webdriver_service, options=chrome_options)
wait = WebDriverWait(browser, 20)
url = ' https://chartviewer-europublic.bigapis.net/nzgaV/index.html'
browser.get(url) 
table = wait.until(EC.element_to_be_clickable((By.ID, "datatable")))
df = pd.read_html(table.get_attribute("outerHTML"))[0]
print(df)

This will get the information as a dataframe, and display it in terminal:

Categoría Ago-22 Ago-21 % Variacion Acumulado 2022 Acumulado 2021 % Variacion Acumulado
0 Gasolina 22.3402 20.0702 11311.31 231.348 279.89 -17-17.34
1 Diesel 8.9639 8.06481 11211.15 92.9799 119.641 -22-22.29
2 Resto 20.6042 19.4492 595.94 208.715 188.782 1110.56
3 Total combustibles 51.9075 47.5835 919.09 533.043 588.314 -9-9.39
4 Particular 24.9512 26.0833 -4,3-4.34 233.413 236.728 -1-1.4
5 Empresa 21.7122 17.6732 22922.85 224.337 215.654 44.03
6 Alquiler 5.24452 3.82738 37037.03 75.2928 135.931 -45-44.61
7 Total canales 51.9075 47.5835 919.09 533.043 588.314 -9-9.39

The selenium setup is for linux. However, if you would just peruse the questions on Selenium on this forum, you would find countless examples of selenium/chromedriver setups for Windows, if you are using Windows (or for Mac, for that matter).

Also, Selenium documentation is helpful: https://www.selenium.dev/documentation/webdriver/getting_started/

  • Related