Home > Mobile >  Pandas Scrape switching values on scrape
Pandas Scrape switching values on scrape

Time:08-17

import requests
import pandas as pd
import csv


headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
}

urls = [
    'https://stats.ncaa.org/game/play_by_play/12465',
    'https://stats.ncaa.org/game/play_by_play/12755',
    'https://stats.ncaa.org/game/play_by_play/12640',
    'https://stats.ncaa.org/game/play_by_play/12290',
]
s = requests.Session()
s.headers.update(headers)
for url in urls:
    r = s.get(url)
    dfs = pd.read_html(r.text)
    for df in dfs:
        df.to_csv('pbp.csv', mode='a', index=False)

I have this amazing script I've struggled through (with the help of this community). It pulls the table data from these different sites, and pushes to a CSV.

For some reason, the "Score" in some places loses it's format (Image below). I've tried formatting in Excel/Sheets, but no luck.

I don't see anything different in Inspector with these lines -- it's not all of them, only a select few on each link.

Any way I could prevent this from happening? Doesn't make much sense to me.

Incorrect Score

CodePudding user response:

Seems like it's converting the scores to dates, no? Maybe you can cast that column to string?

dfs = pd.read_html(r.text, converters={'Score': lambda x: str(x)})

I'd try that...

CodePudding user response:

Thinking about it further, you can keep the values a integers and not having it confused as a date object if we just split the column. I also separated each game into separate files as well.

import requests
import pandas as pd
import csv


headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
}

urls = [
    'https://stats.ncaa.org/game/play_by_play/12465',
    'https://stats.ncaa.org/game/play_by_play/12755',
    'https://stats.ncaa.org/game/play_by_play/12640',
    'https://stats.ncaa.org/game/play_by_play/12290',
]
s = requests.Session()
s.headers.update(headers)
for url in urls:
    gameId = url.split('/')[-1]
    r = s.get(url)
    dfs = pd.read_html(r.text)
    for df in dfs:
        if len(df.columns) > 2:
            if df.iloc[0,2] == 'Score':
                df[4] = df[3]
                df[[2,3]] = df[2].str.split('-', expand=True)
         
        df.to_csv('pbp_%s.csv' %gameId, mode='a', index=False)

Output:

enter image description here

  • Related