Hi I am currently having issues using bs4 and regex to find information within the html as they are contained within : and not = like I am used to.
<div data-react-cache-id="ListItemSale-0" data-react- data-react-props='{"imageUrl":"https://laced.imgix.net/products/aa0ff81c-ec3b-4275-82b3-549c819d1404.jpg?w=196","title":{"label":"Air Jordan 1 Mid Madder Root GS","href":"/products/air-jordan-1-mid-madder-root-gs"},"contentCount":3,"info":"UK 4.5 | EU 37.5 | US 5","subInfo":"DM9077-108","hasStatus":true,"isBuyer":false,"status":"pending_shipment","statusOverride":null,"statusMessage":"Pending","statusMods":["red"],"price":"£125","priceAction":null,"subPrice":null,"actions":[{"label":"View","href":"/account/selling/M2RO1DNV"},{"label":"Re-Print Postage","href":"/account/selling/M2RO1DNV/shipping-label","options":{"disabled":false}},{"label":"View Postage","href":"/account/selling/M2RO1DNV/shipping-label.pdf","options":{"target":"_blank","disabled":false}}]}'></div>
I am trying to extract the href link in
{"label":"Re-Print Postage","href":"/account/selling/M2RO1DNV/shipping-label"
How do I do this? I've tried regex, find_all but with no avail. Thanks
My code below for reference, I've put # next to the solutions I have tried on top of many others
account_soup = bs(my_account.text, 'lxml')
links = account_soup.find_all('div', {'data-react-class': 'ListItemSale'})
#for links in download_link['actions']:
#print(links['href'])
#for i in links:
#link_main = i.find('title')
#link = re.findall('^/account*shipping-label$', link_main)
#print(link)
CodePudding user response:
You need to fetch the data-react-props
attribute of each div
, then parse that as JSON. You can then iterate the actions
property and get the href
property that matches your :
actions = []
for l in links:
props = json.loads(l['data-react-props'])
for a in props['actions']:
m = re.match(r'^/account.*shipping-label$', a['href'])
if m is not None:
actions.append(m[0])
print(actions)
Output for your sample data:
['/account/selling/M2RO1DNV/shipping-label']