DataFrame constructor not properly called when using data in HTML file-CodePudding

I would like to put some data in a html file into a pandas dataframe but I'm getting the error '. My data has the following structures. It is the data between the square brackets after lots I would like to put into a dataframe but I'm pretty confused as to what type of object this is.

html_doc = """<html><head><script>


"unrequired_data = [{"ID":XXX, "Name":XXX, "Price":100GBP, "description": null },
        {"ID":XXX, "Name":XXX, "Price":150GBP, "description": null },
        {"ID":XXX, "Name":XXX, "Price":150GBP, "description": null }]

"lots":[{"ID":123, "Name":ABC, "Price":100, "description": null },
    {"ID":456, "Name":DEF, "Price":150, "description": null },
    {"ID":789, "Name":GHI, "Price":150, "description": null }]

</script></head></html>"""

I have tried the following code

from bs4 import BeautifulSoup
import pandas as pd

soup = BeautifulSoup(html_doc)
df = pd.DataFrame("lots")

The output I would like to get would be in this format.

CodePudding user response：

Your data is not valid JSON, so you need to fix it.

I would use:

from bs4 import BeautifulSoup
import pandas as pd
import json, re

soup = BeautifulSoup(html_doc)

# extract script
script = soup.find("script").text.strip()

# get first value that starts with "lot"
data = next((s.split(':', maxsplit=1)[-1] for s in re.split('\n{2,}', script) if s.startswith('"lots"')), None)

# fix the json
if data:
    data = (re.sub(r':\s*([^",}] )\s*', r':"\1"', data))

df = pd.DataFrame(json.loads(data))

print(df)

Output:

    ID Name Price description
0  123  ABC   100       null 
1  456  DEF   150       null 
2  789  GHI   150       null

CodePudding user response：

Try this:

from bs4 import BeautifulSoup
import pandas as pd
import json

soup = BeautifulSoup(html_doc)
script = soup.find("script")

# Extract the JSON data from the script tag
json_data = script.text.split("lots=")[1]

# Load the JSON data into a Python dict
data = json.loads(json_data)

# Convert the dict to a Pandas dataframe
df = pd.DataFrame(data)