Home > Blockchain >  extract a class from beatiful soup
extract a class from beatiful soup

Time:11-11

I Have a HTML script that after extraction looks something like this:

>
    </div>
   </div>
  </div>
  <script id="__NEXT_DATA__" type="application/json">
   {"props":{"pageProps": {"type":"Job","sid":"a84cacbbcb07ec55cdbfd5fbe3d9f252d7f9cdd0","loggedIn":false,"userId":null,"avatar":null,"rating":{"count":"743","value":6.6},"metadata":{"title":"Medior/Senior Tester0"}

I am interested in extracting certain key value pairs of this script into a dataframe. I would for example like a column named "title" with the value "Medior/Senior Tester0" and a column "customer" filled with null.

soup.find('a-toaster Toaster_toaster__bTabZ') results in a nonetype nonetype error. What would be a good way to extract for example the title of this html (medior/senior tester) ?

CodePudding user response:

Try:

import json
from bs4 import BeautifulSoup

html_doc = """\
<div >
    </div>
   </div>
  </div>
  <script id="__NEXT_DATA__" type="application/json">
   {"props":{"pageProps": {"type":"Job","sid":"a84cacbbcb07ec55cdbfd5fbe3d9f252d7f9cdd0","loggedIn":false,"userId":null,"avatar":null,"rating":{"count":"743","value":6.6},"metadata":{"title":"Medior/Senior Tester0"} } } }
  </script>
</div>"""

soup = BeautifulSoup(html_doc, "html.parser")

# locate the script tag:
script = soup.select_one("#__NEXT_DATA__")

# decode the json:
data = json.loads(script.text)

# print all data:
print(data)

Prints:

{
    "props": {
        "pageProps": {
            "type": "Job",
            "sid": "a84cacbbcb07ec55cdbfd5fbe3d9f252d7f9cdd0",
            "loggedIn": False,
            "userId": None,
            "avatar": None,
            "rating": {"count": "743", "value": 6.6},
            "metadata": {"title": "Medior/Senior Tester0"},
        }
    }
}

To print the title:

print(data["props"]["pageProps"]["metadata"]["title"])

Prints:

Medior/Senior Tester0
  • Related