Home > Net >  How to find data in HTML using bs4 when it has : and not =
How to find data in HTML using bs4 when it has : and not =

Time:09-28

Hi I am currently having issues using bs4 and regex to find information within the html as they are contained within : and not = like I am used to.

<div data-react-cache-id="ListItemSale-0" data-react- data-react-props='{"imageUrl":"https://laced.imgix.net/products/aa0ff81c-ec3b-4275-82b3-549c819d1404.jpg?w=196","title":{"label":"Air Jordan 1 Mid Madder Root GS","href":"/products/air-jordan-1-mid-madder-root-gs"},"contentCount":3,"info":"UK 4.5 | EU 37.5 | US 5","subInfo":"DM9077-108","hasStatus":true,"isBuyer":false,"status":"pending_shipment","statusOverride":null,"statusMessage":"Pending","statusMods":["red"],"price":"£125","priceAction":null,"subPrice":null,"actions":[{"label":"View","href":"/account/selling/M2RO1DNV"},{"label":"Re-Print Postage","href":"/account/selling/M2RO1DNV/shipping-label","options":{"disabled":false}},{"label":"View Postage","href":"/account/selling/M2RO1DNV/shipping-label.pdf","options":{"target":"_blank","disabled":false}}]}'></div>

I am trying to extract the href link in

{"label":"Re-Print Postage","href":"/account/selling/M2RO1DNV/shipping-label"

How do I do this? I've tried regex, find_all but with no avail. Thanks

My code below for reference, I've put # next to the solutions I have tried on top of many others

    account_soup = bs(my_account.text, 'lxml')

    links = account_soup.find_all('div', {'data-react-class': 'ListItemSale'})
    

#for links in download_link['actions']:
    #print(links['href'])


#for i in links:
    #link_main = i.find('title')
    #link = re.findall('^/account*shipping-label$', link_main)
    #print(link)

CodePudding user response:

You need to fetch the data-react-props attribute of each div, then parse that as JSON. You can then iterate the actions property and get the href property that matches your :

actions = []
for l in links:
    props = json.loads(l['data-react-props'])
    for a in props['actions']:
        m = re.match(r'^/account.*shipping-label$', a['href'])
        if m is not None:
            actions.append(m[0])

print(actions)

Output for your sample data:

['/account/selling/M2RO1DNV/shipping-label']
  • Related