I'm scraping a bunch of website data and am able to print to terminal, but having real trouble pushing it directly to a sheet.
I can confirm Gspread connection is working - my code loops through the names, but doesn't touch the sheet.
I thought the last line of code (append) would place the results.
Ultimately, I just want to full result of my code in columns on the sheet, but having trouble.
If there's a better approach would love any guidance:
https://docs.google.com/spreadsheets/d/1TD4YmhfAsnSL_Fwo1lckEbnUVBQB6VyKC05ieJ7PKCw/edit#gid=0
import requests
from bs4 import BeautifulSoup
import gspread
gc = gspread.service_account(filename='creds.json')
sh = gc.open_by_key('1TD4YmhfAsnSL_Fwo1lckEbnUVBQB6VyKC05ieJ7PKCw')
worksheet = sh.sheet1
# AddValue = ["Test", 25, "Test2"]
# worksheet.insert_row(AddValue, 3)
def get_links(url):
data = []
req_url = requests.get(url)
soup = BeautifulSoup(req_url.content, "html.parser")
for td in soup.find_all('td', {'data-th': 'Player'}):
a_tag = td.a
name = a_tag.text
player_url = a_tag['href']
print(f"Getting {name}")
req_player_url = requests.get(
f"https://basketball.realgm.com{player_url}")
soup_player = BeautifulSoup(req_player_url.content, "html.parser")
div_profile_box = soup_player.find("div", class_="profile-box")
row = {"Name": name, "URL": player_url}
for p in div_profile_box.find_all("p"):
try:
key, value = p.get_text(strip=True).split(':', 1)
row[key.strip()] = value.strip()
except: # not all entries have values
pass
data.append(row)
return data
urls = [
'https://basketball.realgm.com/dleague/players/2022',
]
for url in urls:
print(f"Getting: {url}")
data = get_links(url)
for entry in data:
worksheet.append_row(entry)
CodePudding user response:
Modification points:
- When I saw your script, I thought that
data
ofdata = get_links(url)
is an array including JSON object. In the current stage,append_row
cannot directly use the JSON object. I thought that this is the reason for your current issue. - In your script,
append_row
is used in the loop. In this case, the process cost will become high.
When these points are reflected in your script, how about the following modification?
From:
for url in urls:
print(f"Getting: {url}")
data = get_links(url)
for entry in data:
worksheet.append_row(entry)
To:
res = []
for url in urls:
print(f"Getting: {url}")
data = get_links(url)
res = [*res, *data]
if res != []:
header = list(res[0].keys())
values = [header, *[[e[k] if e.get(k) else "" for k in header] for e in res]]
worksheet.append_rows(values, value_input_option="USER_ENTERED")
- In this modification, all values are retrieved in the for a loop. And, the retrieved values are put into the Spreadsheet by one API call with
append_rows
.
Note:
- In this modification, the header row is used with
header = list(res[0].keys())
. If you want to use your expected ordered headers, please manually put it toheader = list(res[0].keys())
.