I Have a HTML script that after extraction looks something like this:
>
</div>
</div>
</div>
<script id="__NEXT_DATA__" type="application/json">
{"props":{"pageProps": {"type":"Job","sid":"a84cacbbcb07ec55cdbfd5fbe3d9f252d7f9cdd0","loggedIn":false,"userId":null,"avatar":null,"rating":{"count":"743","value":6.6},"metadata":{"title":"Medior/Senior Tester0"}
I am interested in extracting certain key value pairs of this script into a dataframe. I would for example like a column named "title" with the value "Medior/Senior Tester0" and a column "customer" filled with null.
soup.find('a-toaster Toaster_toaster__bTabZ') results in a nonetype nonetype error. What would be a good way to extract for example the title of this html (medior/senior tester) ?
CodePudding user response:
Try:
import json
from bs4 import BeautifulSoup
html_doc = """\
<div >
</div>
</div>
</div>
<script id="__NEXT_DATA__" type="application/json">
{"props":{"pageProps": {"type":"Job","sid":"a84cacbbcb07ec55cdbfd5fbe3d9f252d7f9cdd0","loggedIn":false,"userId":null,"avatar":null,"rating":{"count":"743","value":6.6},"metadata":{"title":"Medior/Senior Tester0"} } } }
</script>
</div>"""
soup = BeautifulSoup(html_doc, "html.parser")
# locate the script tag:
script = soup.select_one("#__NEXT_DATA__")
# decode the json:
data = json.loads(script.text)
# print all data:
print(data)
Prints:
{
"props": {
"pageProps": {
"type": "Job",
"sid": "a84cacbbcb07ec55cdbfd5fbe3d9f252d7f9cdd0",
"loggedIn": False,
"userId": None,
"avatar": None,
"rating": {"count": "743", "value": 6.6},
"metadata": {"title": "Medior/Senior Tester0"},
}
}
}
To print the title:
print(data["props"]["pageProps"]["metadata"]["title"])
Prints:
Medior/Senior Tester0