I have found a json response of a webpage and scraped it with selenium using the code below:
from selenium import webdriver
url = "website.json"
driver.get(url)
text = driver.page_source
with open("data.json", "tw",encoding="utf-8") as html_file:
html_file.write(text)
But when I open the file it is like this:
<html><head></head><body><pre style="word-wrap: break-word; white-space: pre-wrap;">{
"Status": "OK",
"TotalRows": 386,
"Items": [
...
]
}</pre></body></html>
So the json file shows in the middle of two html tags. To solve this problem I have tried this code:
t1 = text.replace('<html><head></head><body><pre style="word-wrap: break-word; white-space: pre-wrap;">', "")
t2 = t1.replace('</pre></body></html>', "")
with open('data.json', 'w') as outfile:
json.dump(t2, outfile, indent=2)
But when I run this, data.json contains strings like this:
"{\n \"Status\": \"OK\",\n \"TotalRows\": 401,\n \"Items\": [\n ...\n ]\n}"
What should I do?
CodePudding user response:
You should try to find only the element you want and extract it's text:
driver.get(url)
element = driver.findElement(By.TAG_NAME, 'pre')
with open('data.json', 'w') as file:
json.dump(element.text, file)
CodePudding user response:
It seems after cleaning the HTML tags
through:
t1 = text.replace('<html><head></head><body><pre style="word-wrap: break-word; white-space: pre-wrap;">', "")
t2 = t1.replace('</pre></body></html>', "")
I need to load t2
as a json file as below:
res = json.loads(t2)
with open('data.json', 'w') as outfile:
json.dump(res, outfile, indent=4)
This worked for me.