I've written a function that is ment to check if a phrase is in a certain website, however, it is always telling me that it isn't in the website even when it is. I'm relativly new to webscraping so any help would be appreciated.
def check_availability(url,phrase):
global log
try:
# page = urllib.request.urlopen(url)
r = requests.get(url)
soup = BeautifulSoup(url, 'html.parser')
if phrase in soup.text:
return False
return True
except:
log = "Error parsing website "
this always returned true for some reason please help.
CodePudding user response:
Modified function:
import requests
from bs4 import BeautifulSoup
def url_contains(url, phrase):
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
return phrase in soup.get_text()
Example:
url = 'https://en.wikipedia.org/wiki/Carl_Friedrich_Gauss'
>>> url_contains(url, 'Princeps mathematicorum')
True
>>> url_contains(url, 'foo bar')
False
Slightly optimized:
import requests
from bs4 import BeautifulSoup
from functools import lru_cache
@lru_cache(maxsize=4)
def get_soup(url):
return BeautifulSoup(requests.get(url).content, 'html.parser')
def url_contains(url, phrase):
return phrase in get_soup(url).get_text()
This caches the soup
obtained from a url, so you can repeatedly query many phrases for a given url. For the above: the first query takes ~1/3s; subsequent queries against that URL take ~4ms.