If duplicate URL entry exists don't append the data (BeautifulSoup to Google Sheets)-CodePudding

I was wondering if you can help.

I'm using beautifulsoup to write to Google Sheets.

I've created a crawler that runs through a series of URLs, scrapes the content and then updates a Google sheet.

What I now want to do is if a duplicate URL exists (in column c) to prevent it from being written to my sheet again.

e.g If I had the url https://www.bbc.co.uk/1 in my table I wouldn't want it appearing in my table again.

Here is my code:

from cgitb import text
import requests
from bs4 import BeautifulSoup
import gspread
import datetime
import urllib.parse

gc = gspread.service_account(filename='creds.json')
sh = gc.open('scrapetosheets').sheet1

urls = ["https://www.ig.com/uk/trading-strategies", "https://www.ig.com/us/trading-strategies"]

for url in urls:
    my_url = requests.get(url)
    html = my_url.content
    soup = BeautifulSoup(html, 'html.parser')

    for item in soup.find_all('h3', class_="article-category-section-title"):
        date = datetime.datetime.now()
        title = item.find('a', class_ = 'primary js_target').text.strip()
        url = item.find('a', class_ = 'primary js_target').get('href')

        abs = "https://www.ig.com"
        rel = url
    
        info = {'date':date, 'title':title, 'url':urllib.parse.urljoin(abs, rel)}
        sh.append_row([str(info['date']), str(info['title']), str(info['url'])])

Thanks in advance.

Mark

I'd like to know what i can add to the end of my code to prevent duplicate URLs being entered into my Google Sheet.

CodePudding user response：

I believe your goal is as follows.

You want to put the values of [str(info['date']), str(info['title']), str(info['url'])], when the value of str(info['url']) is not existing in the column "C".

Modification points:

In this case, it is required to check the column "C" of the existing sheet of sh = gc.open('scrapetosheets').sheet1. This has already been mentioned in the TheMaster's comment.
When I saw your script, append_row is used in a loop. In this case, the process cost will become high.

When these points are reflected in your script, how about the following modification?

Modified script:

from cgitb import text
import requests
from bs4 import BeautifulSoup
import gspread
import datetime
import urllib.parse

gc = gspread.service_account(filename='creds.json')
sh = gc.open('scrapetosheets').sheet1

urls = ["https://www.ig.com/uk/trading-strategies", "https://www.ig.com/us/trading-strategies"]

# I modified the below script.
obj = {r[2]: True for r in sh.get_all_values()}
ar = []

for url in urls:
    my_url = requests.get(url)
    html = my_url.content
    soup = BeautifulSoup(html, "html.parser")
    for item in soup.find_all("h3", class_="article-category-section-title"):
        date = datetime.datetime.now()
        title = item.find("a", class_="primary js_target").text.strip()
        url = item.find("a", class_="primary js_target").get("href")
        abs = "https://www.ig.com"
        rel = url
        info = {"date": date, "title": title, "url": urllib.parse.urljoin(abs, rel)}
        url = str(info["url"])
        if url not in obj:
            ar.append([str(info["date"]), str(info["title"]), url])

if ar != []:
    sh.append_rows(ar, value_input_option="USER_ENTERED")

When this script is run, first, the values are retrieved from the sheet, and create an object for searching the value of str(info["url"]). When the value of str(info["url"]) is not existing in column "C" of the sheet, the values are put into an array. And then, the array is appended to the sheet.

Reference:

append_rows