In a given .html page, I have a script tag like so:
<script>
atomic({
"playlist": [{
"id": "123456",
"email": "[email protected]",
"token": "92426029ccf14bca5e495a419868af30"
}]
}
}).$mount('#app');
</script>
How can I use Beautiful Soup to extract the email address?
CodePudding user response:
You can locate the <script>
tag using soup.find('script')
, and then use the built-in re
module to extract the email after calling .string
:
soup = BeautifulSoup(script, "html.parser")
script_tag = soup.find("script")
print(
re.search(r'"email": "(.*?)"', script_tag.string).group(1)
)
Output:
[email protected]
CodePudding user response:
You could do it this way.
- Select the
<script>
tag first and then extract it's text using.string
. Note thatget_text()
will not work for<script>
- To get the internal JSON string do some string manipulations like - Remove tabs, newlines, spaces etc., and strip off unwanted data.
- Convert the JSON string to a JSON object using
json
module and extract the info you need.
Here is how it is done.
import json
import re
from bs4 import BeautifulSoup
s = """
<script>
atomic({
"playlist": [{
"id": "123456",
"email": "[email protected]",
"token": "92426029ccf14bca5e495a419868af30"
}]
}).$mount('#app');
</script>"""
soup = BeautifulSoup(s, 'lxml')
t = soup.find('script')
x = t.string.strip()
x = re.sub(r"[\n\t\s]*", "", x) #Removing the newlines, spaces and tabs from string
# Stripping off unwanted characters from string to get the internal JSON string
x = x.lstrip('atomic(')
x = x.rstrip(').$mount(\'#app\');')
json_str = json.loads(x)
for i,v in json_str['playlist'][0].items():
print(f"{i}: {v}")
id: 123456
email: [email protected]
token: 92426029ccf14bca5e495a419868af30