I was wondering if you can help.
I'm using beautifulsoup to write to Google Sheets.
I've created a crawler that runs through a series of URLs, scrapes the content and then updates a Google sheet.
What I now want to do is if a duplicate URL exists (in column c) to prevent it from being written to my sheet again.
e.g If I had the url https://www.bbc.co.uk/1 in my table I wouldn't want it appearing in my table again.
Here is my code:
from cgitb import text
import requests
from bs4 import BeautifulSoup
import gspread
import datetime
import urllib.parse
gc = gspread.service_account(filename='creds.json')
sh = gc.open('scrapetosheets').sheet1
urls = ["https://www.ig.com/uk/trading-strategies", "https://www.ig.com/us/trading-strategies"]
for url in urls:
my_url = requests.get(url)
html = my_url.content
soup = BeautifulSoup(html, 'html.parser')
for item in soup.find_all('h3', class_="article-category-section-title"):
date = datetime.datetime.now()
title = item.find('a', class_ = 'primary js_target').text.strip()
url = item.find('a', class_ = 'primary js_target').get('href')
abs = "https://www.ig.com"
rel = url
info = {'date':date, 'title':title, 'url':urllib.parse.urljoin(abs, rel)}
sh.append_row([str(info['date']), str(info['title']), str(info['url'])])
Thanks in advance.
Mark
I'd like to know what i can add to the end of my code to prevent duplicate URLs being entered into my Google Sheet.
CodePudding user response:
I believe your goal is as follows.
- You want to put the values of
[str(info['date']), str(info['title']), str(info['url'])]
, when the value ofstr(info['url'])
is not existing in the column "C".
Modification points:
In this case, it is required to check the column "C" of the existing sheet of
sh = gc.open('scrapetosheets').sheet1
. This has already been mentioned in the TheMaster's comment.When I saw your script,
append_row
is used in a loop. In this case, the process cost will become high.
When these points are reflected in your script, how about the following modification?
Modified script:
from cgitb import text
import requests
from bs4 import BeautifulSoup
import gspread
import datetime
import urllib.parse
gc = gspread.service_account(filename='creds.json')
sh = gc.open('scrapetosheets').sheet1
urls = ["https://www.ig.com/uk/trading-strategies", "https://www.ig.com/us/trading-strategies"]
# I modified the below script.
obj = {r[2]: True for r in sh.get_all_values()}
ar = []
for url in urls:
my_url = requests.get(url)
html = my_url.content
soup = BeautifulSoup(html, "html.parser")
for item in soup.find_all("h3", class_="article-category-section-title"):
date = datetime.datetime.now()
title = item.find("a", class_="primary js_target").text.strip()
url = item.find("a", class_="primary js_target").get("href")
abs = "https://www.ig.com"
rel = url
info = {"date": date, "title": title, "url": urllib.parse.urljoin(abs, rel)}
url = str(info["url"])
if url not in obj:
ar.append([str(info["date"]), str(info["title"]), url])
if ar != []:
sh.append_rows(ar, value_input_option="USER_ENTERED")
- When this script is run, first, the values are retrieved from the sheet, and create an object for searching the value of
str(info["url"])
. When the value ofstr(info["url"])
is not existing in column "C" of the sheet, the values are put into an array. And then, the array is appended to the sheet.