I want to scrape games' information. However, some games' name contains "@", such as the game "Ampers@t".
When I try to scrape such games' title, the code will return me "[email protected]". Apparently, my code does not recognize that this is game's name, and is not an email.
Here are my codes used.
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
from bs4 import BeautifulSoup
session = requests.Session()
retry = Retry(connect=3, backoff_factor=0.5)
adapter = HTTPAdapter(max_retries=retry)
session.mount("http://", adapter)
session.mount("https://", adapter)
def grab_soup(url):
"""Takes a url and returns a BeautifulSoup object"""
response = session.get(url, headers={"User-Agent": "Mozilla/5.0"})
assert response.status_code == 200, "Problem with url request! %s throws %s" % (
url,
response.status_code,
) # checking that it worked
page = response.text
soup = BeautifulSoup(page, "lxml")
return soup
soup = grab_soup("https://www.mobygames.com/game/amperst")
header = soup.find(class_="niceHeaderTitle")("a")
What I expect the output of the header is
<a href="https://www.mobygames.com/game/amperst">Ampers@t</a>,
...
However, the output is:
<a href="https://www.mobygames.com/game/amperst"><span data-cfemail="ce8fa3beabbcbd8eba">[email protected]</span></a>,
...
And I try to check the page source of the game. Indeed, the page source is recorded like this:
<div >
<h1 >
<a href="https://www.mobygames.com/game/amperst">
<span data-cfemail="cd8ca0bda8bfbe8db9">[email protected]
</span>
</a>"
Therefore, the reason my code gives me the content probably because it is what being recorded in the page source. But IS there a way that I can get rid of this issue?
I found this solution. But it is undesire for me as I have lots of games to scrape and cannot run the code for each of game.
CodePudding user response:
[Expanded from my comment] If you tweak the function from this answer a bit to
def deCFEmail(encTag):
if not (encTag.get('data-cfemail') or encTag.select('*[data-cfemail]')):
encTag.append(f'[! no "data-cfemail" attribute !]')
else:
fp = encTag.get('data-cfemail', None)
if fp is None:
fp = encTag.select_one('*[data-cfemail]').get('data-cfemail')
try:
r = int(fp[:2],16)
encTag.string = ''.join([chr(int(fp[i:i 2], 16) ^ r) for i in range(2, len(fp), 2)])
except Exception as e:
encTag.append(f'! failed to decode "{e}"')
return encTag
then you can use it conditionally:
for i, h in enumerate(header):
if h.get_text()=='[email\xa0protected]':
header[i] = deCFEmail(h)
So, header
can go from
[<a href="https://www.mobygames.com/game/amperst"><span data-cfemail="8ecfe3feebfcfdcefa">[email protected]</span></a>,
<a href="https://www.mobygames.com/game/amperst/forums">Discuss</a>,
<a href="https://www.mobygames.com/game/sheet/review_game/amperst/">Review</a>,
<a href="https://www.mobygames.com/game/amperst/add-to-want-list"> Want</a>,
<a href="https://www.mobygames.com/game/amperst/add-to-have-list"> Have</a>,
<a href="https://www.mobygames.com/game/amperst/contribute">Contribute</a>]
to
[<a href="https://www.mobygames.com/game/amperst">Ampers@t</a>,
<a href="https://www.mobygames.com/game/amperst/forums">Discuss</a>,
<a href="https://www.mobygames.com/game/sheet/review_game/amperst/">Review</a>,
<a href="https://www.mobygames.com/game/amperst/add-to-want-list"> Want</a>,
<a href="https://www.mobygames.com/game/amperst/add-to-have-list"> Have</a>,
<a href="https://www.mobygames.com/game/amperst/contribute">Contribute</a>]
it is kind of a speed requirement
The additional time would not be unlike adding an extra find
statement; and anyways, it would be insignificant compared to the parsing time (for soup = BeautifulSoup(page, "lxml")
).