I have some problem with webscraping. I need data from betting site, scrape and store it at dataframe.
My code:
import numpy as numpy
import pandas as pd
from urllib.parse import urljoin
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time
DRIVER_PATH = 'C:\\executables\\chromedriver.exe'
options = Options()
options.headless = True
options.add_argument("--window-size=1920,1200")
driver = webdriver.Chrome(options=options, executable_path=DRIVER_PATH)
driver.get("https://www.nike.sk/live-stavky/futbal")
time.sleep(10)
soup = BeautifulSoup(driver.page_source, 'html.parser')
# match time
out_1 = soup.find_all(class_='ellipsis flex fs-10 c-black-50 justify-between pr-5')
# home and away teams
out_2 = soup.find_all(class_='ellipsis f-condensed c-black-100 text-extra-bold match-opponents pr-10')
# match status
out_3 = soup.find_all(class_='flex justify-center text-right flex-col match-score-col fs-12 c-orange text-extra-bold')
# match status 2
out_4 = soup.find_all(class_='flex justify-center text-right flex-col match-score-col fs-12 text-extra-bold c-default-light')
My output (out_1, ..., out_4) is messy blocks of text. How can I put it in a complete dataframe? Can I turn it to dataframe without regex?
CodePudding user response:
You can try to use their Ajax API to download the data in Json format, then make dataframe from this data:
import json
import re
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = "https://push.nike.sk/snapshot?path=/n1/overview/futbal/tournaments/"
html_doc = requests.get("https://www.nike.sk/live-stavky/futbal").text
token = re.search(r'"securityToken":"([^"] )"', html_doc).group(1)
data = json.loads(requests.get(url, headers={"x-security-token": token}).json()[0][-1])
all_data = []
for m in data["matches"]:
s1 = m["score"]["scores"]["TOTAL"]["home"]
s2 = m["score"]["scores"]["TOTAL"]["away"]
all_data.append((m["home"]["en"], m["away"]["en"], s1, s2))
df = pd.DataFrame(all_data, columns=["Team 1", "Team 2", "Score 1", "Score 2"])
print(df)
Prints:
Team 1 Team 2 Score 1 Score 2
0 Barito Putera Makassar 1 1
1 Rahmatgonj MFS Sheikh Jamal 2 0
2 Stredoafrická republika SRL Etiópia SRL 2 1
3 Kosovo SRL Arménsko SRL 0 0
4 Mohammedan Dhaka Azampur FC Uttara 3 0