Home > Software engineering >  Printing website scrape loop directly into Google Sheets
Printing website scrape loop directly into Google Sheets

Time:11-25

I'm scraping a bunch of website data and am able to print to terminal, but having real trouble pushing it directly to a sheet.

I can confirm Gspread connection is working - my code loops through the names, but doesn't touch the sheet.

I thought the last line of code (append) would place the results.

Ultimately, I just want to full result of my code in columns on the sheet, but having trouble.

If there's a better approach would love any guidance:

https://docs.google.com/spreadsheets/d/1TD4YmhfAsnSL_Fwo1lckEbnUVBQB6VyKC05ieJ7PKCw/edit#gid=0

import requests
from bs4 import BeautifulSoup

import gspread
gc = gspread.service_account(filename='creds.json')
sh = gc.open_by_key('1TD4YmhfAsnSL_Fwo1lckEbnUVBQB6VyKC05ieJ7PKCw')
worksheet = sh.sheet1
# AddValue = ["Test", 25, "Test2"]
# worksheet.insert_row(AddValue, 3)


def get_links(url):
    data = []
    req_url = requests.get(url)
    soup = BeautifulSoup(req_url.content, "html.parser")

    for td in soup.find_all('td', {'data-th': 'Player'}):
        a_tag = td.a
        name = a_tag.text
        player_url = a_tag['href']
        print(f"Getting {name}")

        req_player_url = requests.get(
            f"https://basketball.realgm.com{player_url}")
        soup_player = BeautifulSoup(req_player_url.content, "html.parser")
        div_profile_box = soup_player.find("div", class_="profile-box")
        row = {"Name": name, "URL": player_url}

        for p in div_profile_box.find_all("p"):
            try:
                key, value = p.get_text(strip=True).split(':', 1)
                row[key.strip()] = value.strip()
            except:     # not all entries have values
                pass

        data.append(row)

    return data


urls = [
    'https://basketball.realgm.com/dleague/players/2022',
]


for url in urls:
    print(f"Getting: {url}")
    data = get_links(url)

    for entry in data:
        worksheet.append_row(entry)

CodePudding user response:

Modification points:

  • When I saw your script, I thought that data of data = get_links(url) is an array including JSON object. In the current stage, append_row cannot directly use the JSON object. I thought that this is the reason for your current issue.
  • In your script, append_row is used in the loop. In this case, the process cost will become high.

When these points are reflected in your script, how about the following modification?

From:

for url in urls:
    print(f"Getting: {url}")
    data = get_links(url)

    for entry in data:
        worksheet.append_row(entry)

To:

res = []
for url in urls:
    print(f"Getting: {url}")
    data = get_links(url)
    res = [*res, *data]

if res != []:
    header = list(res[0].keys())
    values = [header, *[[e[k] if e.get(k) else "" for k in header] for e in res]]
    worksheet.append_rows(values, value_input_option="USER_ENTERED")
  • In this modification, all values are retrieved in the for a loop. And, the retrieved values are put into the Spreadsheet by one API call with append_rows.

Note:

  • In this modification, the header row is used with header = list(res[0].keys()). If you want to use your expected ordered headers, please manually put it to header = list(res[0].keys()).

Reference:

  • Related