Home > Software design >  how to use regex for get data between two html tags
how to use regex for get data between two html tags

Time:12-29

I have 7,000 line of HTML tag's in list like this

<div  data-enhanced-ecommerce='{"id":3408832,"name":"کیف پول چرم جانتا مدل 124","category":"UNI WALLETS BAGS","brand":"چرم جانتا","variant":11053860,"price":3230000,"quantity":0}' data-id="3408832" data-index="1" data-observed="0" data-price="۳۲۳,۰۰۰" data-seen="false" data-title-en="Janta Leather 124 Wallet" data-title-fa="کیف پول چرم جانتا مدل 124"><div  data-csrf-token="" data-id="3408832"></div><a c-product__seller-details c-product__seller-details--item-grid"><span ><span >فروشنده: </span>

I want between tag start and end

start = '<div  data-enhanced-ecommerce'
end = '></div>'

CodePudding user response:

IIUC, you want to extract the data attributes from the div tags.

Don't use a regex for that but rather a XML/HTML parser.

BeautifulSoup is well indicated as it can handle ill-formatted input (which is the case here).

I've not used BS for a while, so the following example might not be the most efficient/up-to-date way to achieve it but this will give you a starting point:

from bs4 import BeautifulSoup as BS

s = '''<div  data-enhanced-ecommerce='{"id":3408832,"name":"کیف پول چرم جانتا مدل 124","category":"UNI WALLETS BAGS","brand":"چرم جانتا","variant":11053860,"price":3230000,"quantity":0}' data-id="3408832" data-index="1" data-observed="0" data-price="۳۲۳,۰۰۰" data-seen="false" data-title-en="Janta Leather 124 Wallet" data-title-fa="کیف پول چرم جانتا مدل 124"><div  data-csrf-token="" data-id="3408832"></div><a c-product__seller-details c-product__seller-details--item-grid"><span ><span >فروشنده: </span>'''

soup = BS(s)

data = [{k:v for k,v in e.attrs.items() if k.startswith('data')}
        for e in soup.find_all('div')]

Now you have a list with one item per div tag, which is a dictionary containing each attribute starting with data and the corresponding value.

output:

[{'data-enhanced-ecommerce': '{"id":3408832,"name":"کیف پول چرم جانتا مدل 124","category":"UNI WALLETS BAGS","brand":"چرم جانتا","variant":11053860,"price":3230000,"quantity":0}',
  'data-id': '3408832',
  'data-index': '1',
  'data-observed': '0',
  'data-price': '۳۲۳,۰۰۰',
  'data-seen': 'false',
  'data-title-en': 'Janta Leather 124 Wallet',
  'data-title-fa': 'کیف پول چرم جانتا مدل 124'},
 {'data-csrf-token': '', 'data-id': '3408832'}]
  • Related