Home > Mobile >  BeautifulSoup output to dataframe in Python
BeautifulSoup output to dataframe in Python

Time:01-10

I have some problem with webscraping. I need data from betting site, scrape and store it at dataframe.

My code:

import numpy as numpy
import pandas as pd
from urllib.parse import urljoin
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time

DRIVER_PATH = 'C:\\executables\\chromedriver.exe'

options = Options()
options.headless = True
options.add_argument("--window-size=1920,1200")

driver = webdriver.Chrome(options=options, executable_path=DRIVER_PATH)

driver.get("https://www.nike.sk/live-stavky/futbal")

time.sleep(10)


soup = BeautifulSoup(driver.page_source, 'html.parser')

# match time
out_1 = soup.find_all(class_='ellipsis flex fs-10 c-black-50 justify-between pr-5')
# home and away teams
out_2 = soup.find_all(class_='ellipsis f-condensed c-black-100 text-extra-bold match-opponents pr-10')
# match status
out_3 = soup.find_all(class_='flex justify-center text-right flex-col match-score-col fs-12 c-orange text-extra-bold')
# match status 2
out_4 = soup.find_all(class_='flex justify-center text-right flex-col match-score-col fs-12 text-extra-bold c-default-light')

My output (out_1, ..., out_4) is messy blocks of text. How can I put it in a complete dataframe? Can I turn it to dataframe without regex?

CodePudding user response:

You can try to use their Ajax API to download the data in Json format, then make dataframe from this data:

import json
import re

import pandas as pd
import requests
from bs4 import BeautifulSoup

url = "https://push.nike.sk/snapshot?path=/n1/overview/futbal/tournaments/"

html_doc = requests.get("https://www.nike.sk/live-stavky/futbal").text

token = re.search(r'"securityToken":"([^"] )"', html_doc).group(1)


data = json.loads(requests.get(url, headers={"x-security-token": token}).json()[0][-1])

all_data = []
for m in data["matches"]:
    s1 = m["score"]["scores"]["TOTAL"]["home"]
    s2 = m["score"]["scores"]["TOTAL"]["away"]
    all_data.append((m["home"]["en"], m["away"]["en"], s1, s2))

df = pd.DataFrame(all_data, columns=["Team 1", "Team 2", "Score 1", "Score 2"])
print(df)

Prints:

                        Team 1             Team 2 Score 1 Score 2
0                Barito Putera           Makassar       1       1
1               Rahmatgonj MFS       Sheikh Jamal       2       0
2  Stredoafrická republika SRL        Etiópia SRL       2       1
3                   Kosovo SRL       Arménsko SRL       0       0
4             Mohammedan Dhaka  Azampur FC Uttara       3       0
  • Related