I have 7,000 line of HTML
tag's in list
like this
<div data-enhanced-ecommerce='{"id":3408832,"name":"کیف پول چرم جانتا مدل 124","category":"UNI WALLETS BAGS","brand":"چرم جانتا","variant":11053860,"price":3230000,"quantity":0}' data-id="3408832" data-index="1" data-observed="0" data-price="۳۲۳,۰۰۰" data-seen="false" data-title-en="Janta Leather 124 Wallet" data-title-fa="کیف پول چرم جانتا مدل 124"><div data-csrf-token="" data-id="3408832"></div><a c-product__seller-details c-product__seller-details--item-grid"><span ><span >فروشنده: </span>
I want between tag start
and end
start = '<div data-enhanced-ecommerce'
end = '></div>'
CodePudding user response:
IIUC, you want to extract the data
attributes from the div
tags.
Don't use a regex for that but rather a XML/HTML parser.
BeautifulSoup
is well indicated as it can handle ill-formatted input (which is the case here).
I've not used BS for a while, so the following example might not be the most efficient/up-to-date way to achieve it but this will give you a starting point:
from bs4 import BeautifulSoup as BS
s = '''<div data-enhanced-ecommerce='{"id":3408832,"name":"کیف پول چرم جانتا مدل 124","category":"UNI WALLETS BAGS","brand":"چرم جانتا","variant":11053860,"price":3230000,"quantity":0}' data-id="3408832" data-index="1" data-observed="0" data-price="۳۲۳,۰۰۰" data-seen="false" data-title-en="Janta Leather 124 Wallet" data-title-fa="کیف پول چرم جانتا مدل 124"><div data-csrf-token="" data-id="3408832"></div><a c-product__seller-details c-product__seller-details--item-grid"><span ><span >فروشنده: </span>'''
soup = BS(s)
data = [{k:v for k,v in e.attrs.items() if k.startswith('data')}
for e in soup.find_all('div')]
Now you have a list with one item per div
tag, which is a dictionary containing each attribute starting with data
and the corresponding value.
output:
[{'data-enhanced-ecommerce': '{"id":3408832,"name":"کیف پول چرم جانتا مدل 124","category":"UNI WALLETS BAGS","brand":"چرم جانتا","variant":11053860,"price":3230000,"quantity":0}',
'data-id': '3408832',
'data-index': '1',
'data-observed': '0',
'data-price': '۳۲۳,۰۰۰',
'data-seen': 'false',
'data-title-en': 'Janta Leather 124 Wallet',
'data-title-fa': 'کیف پول چرم جانتا مدل 124'},
{'data-csrf-token': '', 'data-id': '3408832'}]