I would like to put some data in a html file into a pandas dataframe but I'm getting the error '. My data has the following structures. It is the data between the square brackets after lots I would like to put into a dataframe but I'm pretty confused as to what type of object this is.
html_doc = """<html><head><script>
"unrequired_data = [{"ID":XXX, "Name":XXX, "Price":100GBP, "description": null },
{"ID":XXX, "Name":XXX, "Price":150GBP, "description": null },
{"ID":XXX, "Name":XXX, "Price":150GBP, "description": null }]
"lots":[{"ID":123, "Name":ABC, "Price":100, "description": null },
{"ID":456, "Name":DEF, "Price":150, "description": null },
{"ID":789, "Name":GHI, "Price":150, "description": null }]
</script></head></html>"""
I have tried the following code
from bs4 import BeautifulSoup
import pandas as pd
soup = BeautifulSoup(html_doc)
df = pd.DataFrame("lots")
The output I would like to get would be in this format.
CodePudding user response:
Your data is not valid JSON, so you need to fix it.
I would use:
from bs4 import BeautifulSoup
import pandas as pd
import json, re
soup = BeautifulSoup(html_doc)
# extract script
script = soup.find("script").text.strip()
# get first value that starts with "lot"
data = next((s.split(':', maxsplit=1)[-1] for s in re.split('\n{2,}', script) if s.startswith('"lots"')), None)
# fix the json
if data:
data = (re.sub(r':\s*([^",}] )\s*', r':"\1"', data))
df = pd.DataFrame(json.loads(data))
print(df)
Output:
ID Name Price description
0 123 ABC 100 null
1 456 DEF 150 null
2 789 GHI 150 null
CodePudding user response:
Try this:
from bs4 import BeautifulSoup
import pandas as pd
import json
soup = BeautifulSoup(html_doc)
script = soup.find("script")
# Extract the JSON data from the script tag
json_data = script.text.split("lots=")[1]
# Load the JSON data into a Python dict
data = json.loads(json_data)
# Convert the dict to a Pandas dataframe
df = pd.DataFrame(data)