Home > Software engineering >  Scraping game name containing the "@": scraper recognizes such name as email address
Scraping game name containing the "@": scraper recognizes such name as email address

Time:11-27

I want to scrape games' information. However, some games' name contains "@", such as the game "Ampers@t".

When I try to scrape such games' title, the code will return me "[email protected]". Apparently, my code does not recognize that this is game's name, and is not an email.

Here are my codes used.

import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
from bs4 import BeautifulSoup
session = requests.Session()
retry = Retry(connect=3, backoff_factor=0.5)
adapter = HTTPAdapter(max_retries=retry)
session.mount("http://", adapter)
session.mount("https://", adapter)

def grab_soup(url):
    """Takes a url and returns a BeautifulSoup object"""
    response = session.get(url, headers={"User-Agent": "Mozilla/5.0"})
    assert response.status_code == 200, "Problem with url request! %s throws %s" % (
        url,
        response.status_code,
    )  # checking that it worked
    page = response.text
    soup = BeautifulSoup(page, "lxml")
    return soup

soup = grab_soup("https://www.mobygames.com/game/amperst")

header = soup.find(class_="niceHeaderTitle")("a")

What I expect the output of the header is

<a href="https://www.mobygames.com/game/amperst">Ampers@t</a>,
...

However, the output is:

<a href="https://www.mobygames.com/game/amperst"><span  data-cfemail="ce8fa3beabbcbd8eba">[email protected]</span></a>,
...

And I try to check the page source of the game. Indeed, the page source is recorded like this:

<div >
<h1 >
<a href="https://www.mobygames.com/game/amperst">
<span  data-cfemail="cd8ca0bda8bfbe8db9">[email&#160;protected]
</span>
</a>"

Therefore, the reason my code gives me the content probably because it is what being recorded in the page source. But IS there a way that I can get rid of this issue?

I found this solution. But it is undesire for me as I have lots of games to scrape and cannot run the code for each of game.

CodePudding user response:

[Expanded from my comment] If you tweak the function from this answer a bit to

def deCFEmail(encTag):
    if not (encTag.get('data-cfemail') or encTag.select('*[data-cfemail]')):
        encTag.append(f'[! no "data-cfemail" attribute !]')
    else:
        fp = encTag.get('data-cfemail', None)
        if fp is None:
            fp = encTag.select_one('*[data-cfemail]').get('data-cfemail')
        try:
            r = int(fp[:2],16)
            encTag.string = ''.join([chr(int(fp[i:i 2], 16) ^ r) for i in range(2, len(fp), 2)]) 
        except Exception as e:
            encTag.append(f'! failed to decode "{e}"')
    return encTag

then you can use it conditionally:

for i, h in enumerate(header): 
    if h.get_text()=='[email\xa0protected]': 
        header[i] = deCFEmail(h)

So, header can go from

[<a href="https://www.mobygames.com/game/amperst"><span  data-cfemail="8ecfe3feebfcfdcefa">[email protected]</span></a>,
 <a  href="https://www.mobygames.com/game/amperst/forums">Discuss</a>,
 <a  href="https://www.mobygames.com/game/sheet/review_game/amperst/">Review</a>,
 <a  href="https://www.mobygames.com/game/amperst/add-to-want-list">  Want</a>,
 <a  href="https://www.mobygames.com/game/amperst/add-to-have-list">  Have</a>,
 <a  href="https://www.mobygames.com/game/amperst/contribute">Contribute</a>]

to

[<a href="https://www.mobygames.com/game/amperst">Ampers@t</a>,
 <a  href="https://www.mobygames.com/game/amperst/forums">Discuss</a>,
 <a  href="https://www.mobygames.com/game/sheet/review_game/amperst/">Review</a>,
 <a  href="https://www.mobygames.com/game/amperst/add-to-want-list">  Want</a>,
 <a  href="https://www.mobygames.com/game/amperst/add-to-have-list">  Have</a>,
 <a  href="https://www.mobygames.com/game/amperst/contribute">Contribute</a>]

it is kind of a speed requirement

The additional time would not be unlike adding an extra find statement; and anyways, it would be insignificant compared to the parsing time (for soup = BeautifulSoup(page, "lxml")).

  • Related