I am writing my first python program and hope that you can help me with my current problem.
I try to extract data from a website and I checked the source of the page where a certain string (lets say "thisstring") is part of a line.
In the HTML-code the string is listed under :
<script>
anotherstring;
thisstring = {...};
My current code:
import requests
from bs4 import BeautifulSoup
page = requests.get('www.somewebadress.com')
soup = BeautifulSoup(page.content, 'html.parser')
lines = soup.find_all('script')
x = 0 #counter for script which returns the correct number of <script> parts in the html-code
for line in lines:
x = x 1
txt = line.find('thisstring') #didnt work with "thisstring" either
if txt == None:
print("not found")
else:
print("found")
print(x)
I tried a lot different solutions I found in the www but "thisstring" is never found even if python printed it out with print(line). I think it is quite simple but I tried the whole day to find the correct code.
Does anyone have an idea?
I found several code samples in stackoverflow and other python tutorials for web scraping but non of these worked. I use Spyder. Could this be a problem?
CodePudding user response:
Based on your comments you can use re
module to extract the variable:
import re
html_text = """\
<html>
<script>
otherscript;
</script>
<script>
anotherstring;
thisstring = {"data1": 1, "data2": 2};
</script>
</html>"""
# or:
# html_text = requests.get(...).text
data = re.search(r"thisstring = (\{.*\});", html_text).group(1)
print(data)
Prints:
{"data1": 1, "data2": 2}
Then you can use ast.literal_eval
, json
or js2py
to convert the string to python object:
import json
data = json.loads(data)
print(data)