I'm trying to scrape data from a High Charts table from the following URL:
https://www.pricecharting.com/game/pal-nes/legend-of-zelda
The following code will grab the data from the default chart data labelled (Loose), but I need to extract data from the other items listed on the chart. (CIB, New, Graded, Boxed, Manual). I'm really stuck and not sure how to get the information from the other charts.
Here's my code which works for 'Loose' only (The default option).
import time
import pandas as pd
from selenium.webdriver.chrome.service import Service
from selenium import webdriver
service = Service(executable_path="/driver_selenium/geckodriver.exe")
driver = webdriver.Firefox(service=service)
website = "https://www.pricecharting.com/game/pal-nes/legend-of-zelda#completed-auctions-graded"
driver.get(website)
time.sleep(5)
temp = driver.execute_script('return window.Highcharts.charts[0]'
'.series[0].options.data')
data = [item[1] for item in temp]
print(data)
Bonus points if there's a better way to extract the data which does not use Selenium, something similar to the answer here: Scrape highchart into python
CodePudding user response:
EDIT: We can try the following:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
import json
url = 'https://www.pricecharting.com/game/pal-nes/legend-of-zelda'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
data_script = soup.find('script', string=re.compile("VGPC.chart_data = {"))
# print(data_script.text.split('VGPC.chart_data = ')[1].split(' VGPC.product = {')[0].split(';')[0].strip())
data = json.loads(data_script.text.split('VGPC.chart_data = ')[1].split(' VGPC.product = {')[0].split(';')[0].strip())
df = pd.DataFrame(data)
print(df)
Which returns:
boxonly cib graded manualonly new used
0 [1498888800000, 0] [1498888800000, 17004] [1498888800000, 0] [1498888800000, 0] [1498888800000, 0] [1498888800000, 6510]
1 [1501567200000, 0] [1501567200000, 7878] [1501567200000, 0] [1501567200000, 0] [1501567200000, 23100] [1501567200000, 5409]
2 [1504245600000, 0] [1504245600000, 8879] [1504245600000, 0] [1504245600000, 0] [1504245600000, 23100] [1504245600000, 5169]
3 [1506837600000, 0] [1506837600000, 8432] [1506837600000, 0] [1506837600000, 0] [1506837600000, 37665] [1506837600000, 4499]
4 [1509516000000, 0] [1509516000000, 9286] [1509516000000, 0] [1509516000000, 0] [1509516000000, 37665] [1509516000000, 4513]
... ... ... ... ... ... ...
This dataframe contains 6 columns with historical prices, which are being used in that Highchart graph (prices appears to be in cents). This method avoids the overheads of selenium/chromedriver.
We can now look at individual line charts, for example at 'new':
df_new = pd.DataFrame(data['new'], columns = ['Date_time', 'Price'])
df_new['Date_time'] = pd.to_datetime(df_new['Date_time'], unit="ms")
print(df_new)
And the result is:
Date_time Price
0 2017-07-01 06:00:00 0
1 2017-08-01 06:00:00 23100
2 2017-09-01 06:00:00 23100
3 2017-10-01 06:00:00 37665
4 2017-11-01 06:00:00 37665
... ... ...