Home > other >  How to decode the string encoded by document.write in BeautifulSoup, Python?
How to decode the string encoded by document.write in BeautifulSoup, Python?

Time:10-03

As title says, I'm stuck here for hours with no documentation or any solution.
This is the website where I started: https://idhsaa.org/directory. I cannot access the Email IDs not only over here, but also inside the individual websites that opens up upon clicking on the school names.

The format that I found is something like this:

<p>
    <script>
        document.write(window.atob('PGEgaHJlZj0nbWFpbHRvOnNwZWNrZXI3M0B5YWhvby5jb20nPkVtYWlsPC9hPg=='));
    </script>
    <a href="mailto:[email protected]">Email</a>
    </br>
</p>

I managed to get the encoded code that looks something like this:

mailto:<script>document.write(window.atob('PGEgaHJlZj0nbWFpbHRvOmFkbWluQGlkaHNhYS5vcmcnPmFkbWluQGlkaHNhYS5vcmc8L2E '));</script>

The question is, how do I decode this to get the Email IDs?
Depending on what I saw in the above output that I got, I assume, I need to decode that to get the actual email.

Here's the code that I'd been working on:

def url_parser(url):
    headers = {
        "User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.246',
    }
    html_doc = requests.get(url, headers=headers).text
    soup = BeautifulSoup(html_doc, 'html.parser')
    return soup


def data_fetch(url):
    soup = url_parser(url)
    table = soup.find('table').find('tbody')
    rows = table.find_all('tr')

    data = []
    for row in rows:
        school_name = row.find_all('a')

        for school in school_name:
            if 'school?' in school.get('href'):
                school_website = url.replace('/directory', f'/{school_web_id}')

                school_site = url_parser(school_website)
                principal_email_encoded = school_site.find_all('a')
                for principal_email in principal_email_encoded:
                    email = principal_email.get('href')
                    if 'maito:<script>' in email:
                        print(email.replace('maito:<script>', '').replace(';</script>', ''))



def main():
    url = "https://idhsaa.org/directory"
    data_fetch(url)


if __name__ == "__main__":
    main()

CodePudding user response:

These are base64 encoded strings, you can decode the value by using the base64 module included in the Python Standard library.

For example, after extracting the encoded string you can do the following:

import base64
encoded_str = "PGEgaHJlZj0nbWFpbHRvOmFkbWluQGlkaHNhYS5vcmcnPmFkbWluQGlkaHNhYS5vcmc8L2E "
decoded_html = base64.b64decode(encoded_str).decode("utf-8")
print(decoded_html)

Output:

"<a href='mailto:[email protected]'>[email protected]</a>"

CodePudding user response:

window.atob decodes a base64 encoded string.

You can use base64 to decode it like that:

import base64
encoded = "PGEgaHJlZj0nbWFpbHRvOmFkbWluQGlkaHNhYS5vcmcnPmFkbWluQGlkaHNhYS5vcmc8L2E "
decoded = base64.b64decode(encoded)

You can then use bs4 to extract the email like this:

from bs4 import BeautifulSoup
soup = BeautifulSoup(decoded, features="html.parser")
soup.find("a")["href"]  # mailto:[email protected]
  • Related