Why is my webscraper not detecting any changes?-CodePudding

I wanted to code a websraper with beautifulsoup4 and requests. It scrapes the data of specifc columns of a specific table on a specifc table. It scrapes it once, waits a certain amount of time, scrapes it again and then compares both "scrapes". If there is a difference, it prints "something has changed", and if there isn't, it prints "no changes"

Here is the entire code:

import requests
import time
from bs4 import BeautifulSoup

URL = "https://website.com"
website = requests.get(URL)
soup = BeautifulSoup(website.content, "html.parser")


data = []
table = soup.find("table", class_="table table-bordered table-sm table-responsive")
table_body = table.find('tbody')

rows = table_body.find_all('tr')
for row in rows:
    cols = row.find_all('td')[0]
    cols = [ele.text.strip() for ele in cols]
    data.append([ele for ele in cols if ele]) # Get rid of empty values

    cols2 = row.find_all('td')[1]
    cols2 = [ele.text.strip() for ele in cols2]
    data.append([ele for ele in cols2 if ele])  # Get rid of empty values

    cols3 = row.find_all('td')[2]
    cols3 = [ele.text.strip() for ele in cols3]
    data.append([ele for ele in cols3 if ele])  # Get rid of empty values

    cols4 = row.find_all('td')[3]
    cols4 = [ele.text.strip() for ele in cols4]
    data.append([ele for ele in cols4 if ele])

    cols5 = row.find_all('td')[5]
    cols5 = [ele.text.strip() for ele in cols5]
    data.append([ele for ele in cols5 if ele])


    print(cols, cols2, cols3, cols4, cols5)

time.sleep(600)

for row in rows:
    cols11 = row.find_all('td')[0]
    cols11 = [ele.text.strip() for ele in cols11]
    data.append([ele for ele in cols11 if ele])  # Get rid of empty values

    cols22 = row.find_all('td')[1]
    cols22 = [ele.text.strip() for ele in cols22]
    data.append([ele for ele in cols22 if ele])  # Get rid of empty values

    cols33 = row.find_all('td')[2]
    cols33 = [ele.text.strip() for ele in cols33]
    data.append([ele for ele in cols33 if ele])  # Get rid of empty values

    cols44 = row.find_all('td')[3]
    cols44 = [ele.text.strip() for ele in cols44]
    data.append([ele for ele in cols44 if ele])

    cols55 = row.find_all('td')[5]
    cols55 = [ele.text.strip() for ele in cols55]
    data.append([ele for ele in cols55 if ele])


    print(cols11, cols22, cols33, cols44, cols55)


if(cols == cols11, cols2 == cols22, cols5 == cols55):
    print("no changes")
else:
    print("something has changed")

Problem is: It always says "no changes" even though I know that something had changed. How can fix this?

CodePudding user response：

While lists can be compared in this way, it's not clear how you reached the conclusion that you can use a comma , in place of a logical AND && operator in your if condition.

What you're doing here by wrapping your conditions in parenthesis () and joining them with a comma , (inadvertently, it would seem) is creating a tuple structure; all non-empty tuples evaluate to True. Thus, your script is continually hitting the logic branch you feel should only be entered if there are no changes between any of your data structures.

Instead, use the logical AND && properly (and don't cast the truth values themselves into a tuple) as you seem to intend:

if cols == cols11 && cols2 == cols22 && cols5 == cols55:
    print("no changes")
else:
    print("something has changed")

Tangential to the core of your question, but your code would benefit from (a) naming your variables in a much more descriptive manner, and (b) using datatypes that better fit your use case as opposed to introducing a brand-new numbered variable for every index and unnecessarily duplicating code.

CodePudding user response：

In addition to what others have said, you must make another GET request to the URL after pausing for a while inorder to detect any changes in the data of the webpage.

What you are doing is:

Making a GET request to the URL
Create a soup object of the response.
Extracting the data from the soup and storing them in variables.
Pause for a while - time.sleep(600)
Again extracting the same information from the same soup - (which will always be equal) without making any new GET request.

So you need to add this code right after time.sleep(600) statement to get any modified data from webpage (if any).

URL = "https://website.com"
website = requests.get(URL)
soup = BeautifulSoup(website.content, "html.parser")

table = soup.find("table", class_="table table-bordered table-sm table-responsive")
table_body = table.find('tbody')

rows = table_body.find_all('tr')