How to scrape data from Highcharts using Python-CodePudding

I am trying to scrape data from the chart at https://www.transfermarkt.com/neymar/marktwertverlauf/spieler/68290. I tried accessing the data using the respective xpath of the data in the boxes, but it doesn't seem to work.

I tried using Scrapy:

date = response.xpath('//*[@id="highcharts-0"]/div/span/b[1]').get()
market_value =  response.xpath('//*[@id="highcharts-0"]/div/span/b[1]').get()
club = response.xpath('//*[@id="highcharts-0"]/div/span/b[3]').get()
age = response.xpath('//*[@id="highcharts-0"]/div/span/b[4]').get()

How can I scrape all the data from the chart using Scrapy or Selenium?

CodePudding user response：

This data is being rendered on the client (browser) after consuming an inline JS on the HTML body.

You need regex if you're about to use scrapy

eg (not tested)

import re
import json

body = response.body()
data = re.findall(r"(?<=\'series\'\:).*?}}]}]", body)

if not data:
   return None

data = json.loads(data[0])

CodePudding user response：

import time
from webdriver_manager.chrome import ChromeDriverManager
from selenium import webdriver 
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument("--headless")


driver = webdriver.Chrome(ChromeDriverManager().install(), options = chrome_options)
driver.get(url)
time.sleep(5)

temp = driver.execute_script('return window.Highcharts.charts[0]'
                             '.series[0].options.data')
data = [item for item in temp]
print(data)