Home > database >  How to use Beautiful Soup to extract string in <script> tag email?
How to use Beautiful Soup to extract string in <script> tag email?

Time:11-18

In a given .html page, I have a script tag like so:

 <script>
        atomic({
            "playlist": [{
                "id": "123456",
                "email": "[email protected]",
                "token": "92426029ccf14bca5e495a419868af30"
            }]
            }
        }).$mount('#app');
    </script>

How can I use Beautiful Soup to extract the email address?

CodePudding user response:

You can locate the <script> tag using soup.find('script'), and then use the built-in re module to extract the email after calling .string:

soup = BeautifulSoup(script, "html.parser")
script_tag = soup.find("script")

print(
    re.search(r'"email": "(.*?)"', script_tag.string).group(1)
)

Output:

[email protected]

CodePudding user response:

You could do it this way.

  • Select the <script> tag first and then extract it's text using .string. Note that get_text() will not work for <script>
  • To get the internal JSON string do some string manipulations like - Remove tabs, newlines, spaces etc., and strip off unwanted data.
  • Convert the JSON string to a JSON object using json module and extract the info you need.

Here is how it is done.

import json
import re
from bs4 import BeautifulSoup

s = """
<script>
        atomic({
            "playlist": [{
                "id": "123456",
                "email": "[email protected]",
                "token": "92426029ccf14bca5e495a419868af30"
            }]
        
        }).$mount('#app');
    </script>"""

soup = BeautifulSoup(s, 'lxml')
t = soup.find('script')
x = t.string.strip()
x = re.sub(r"[\n\t\s]*", "", x) #Removing the newlines, spaces and tabs from string

# Stripping off unwanted characters from string to get the internal JSON string
x = x.lstrip('atomic(') 
x = x.rstrip(').$mount(\'#app\');')

json_str = json.loads(x)

for i,v in json_str['playlist'][0].items():
    print(f"{i}: {v}")
id: 123456
email: [email protected]
token: 92426029ccf14bca5e495a419868af30
  • Related