I'm pretty new to python and XML parsing. I need to parse an XML file from the Internet and I'm running into this problem, I don't know how to get the info I want because it is inside a generic tag used hundreds of times inside the document.
The XML file follows this structure:
<text>
<dl>
<dt>1. Information: </dt>
<dd>
<dl>
<dt>1.1) Name: </dt>
<dd>Company name.</dd>
<dt>1.2) ID: </dt>
<dd>Number.</dd>
<dt>1.3) Address: </dt>
<dd>San Bernardo, 45.</dd>
<dt>1.4) City: </dt>
<dd>Madrid.</dd>
<dt>1.5) Province: </dt>
<dd>Madrid.</dd>
<dt>1.6) Postal Code: </dt>
<dd>28015.</dd>
<dt>1.7) Country: </dt>
<dd>Spain.</dd>
<dt>1.8) Code: </dt>
<dd>587463.</dd>
<dt>1.9) Phone: </dt>
<dd> 34 PhoneNumber.</dd>
<dt>1.11) email: </dt>
<dd>[email protected]</dd>
<dt>1.13) Buyer Address: </dt>
<dd>https://example_url.com</dd>
</dl>
</dd>
<dt>2. Organization:</dt>
<dd>
<dl>
<dt>2.1) Type: </dt>
<dd>Administration</dd>
<dt>2.2) Activity: </dt>
<dd>Administration.</dd>
</dl>
</dd>
<dt>4. Codes:</dt>
<dd>
<dl>
<dt>4.1) Main Code: </dt>
<dd>72000000 (Services: software, internet), 72260000 (Software related services) and 72600000 (Informatic services).</dd>
<dt>4.2) Code 2</dt>
<dd>72000000 (Services: software, internet), 72260000 (Software related services) and 72600000 (Informatic services).</dd>
<dt>4.3) Code 3</dt>
<dd>72000000 (Services: software, internet), 72260000 (Software related services) and 72600000 (Informatic services).</dd>
<dt>4.4) Code 4</dt>
<dd>72000000 (Services: software, internet), 72260000 (Software related services) and 72600000 (Informatic services).</dd>
</dl>
</dd>
</text>
This is just a small fragment from the document. From this I want to get all codes inside the tag "Codes"
, in this case the output should be something similiar to: "72000000,72260000,72600000"
.
In this example all codes are the same in every "dd" tag, but since this should work on any file we retrieve from the web, I should get all codes and then eliminate the duplicated ones.
For parsing other tags on the document I'm currently using BeatifulSoup this way:
Parsing document:
soup = get_soup_from_url(url, "lxml")
def find_text_by_tag(self, soup, tag):
item = soup.find(tag)
return item.text if item else ''
The method get_soup_from_url() is getting imported from another file, here's the code:
from selenium import webdriver # Selenium is a web testing library. It is used to automate browser activities.
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup # Beautiful Soup is a Python package for parsing HTML and XML documents. It creates parse trees that is helpful to extract the data easily.
def get_driver_from_url(url):
options = webdriver.ChromeOptions()
options.add_argument("--no-sandbox")
options.add_argument("--disable-dev-shm-usage")
options.add_argument("--headless")
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
driver.get(url)
return driver
def get_soup_from_url(url, parser):
driver = get_driver_from_url(url)
content = driver.page_source
soup = BeautifulSoup(content, parser)
return soup
def get_soup_from_driver(driver, parser):
content = driver.page_source
soup = BeautifulSoup(content, parser)
return soup
def get_page_source_from_url(url):
driver = get_driver_from_url(url)
return driver.page_source
Thanks in advance to anyone that gives me any information on how to do this
CodePudding user response:
Based on your example HTML/XML select your elemtents with css selector
:
data = list(soup.select_one('dt:-soup-contains(" Codes:") dd dl').stripped_strings)
Then iterate the list
to create a dict
and use pattern as mentioned by @Sergey K to extract the digits only:
{k.split(') ')[-1]:re.findall(r'\d{8}', v) for k, v in zip(data[::2], data[1::2])}
That results in:
{'Main Code:': ['72000000', '72260000', '72600000'],
'Code 2': ['72000000', '72260000', '72600000'],
'Code 3': ['72000000', '72260000', '72600000'],
'Code 4': ['72000000', '72260000', '72600000']}
EDIT
To get all as a single string with separator ,
:
','.join(
[
c
for e in soup.select('dt:-soup-contains(" Codes:") dd dd')
for c in re.findall(r'\d{8}', e.text)
]
)
Example
from bs4 import BeautifulSoup
import re
html = '''
<text>
<dl>
<dt>1. Information: </dt>
<dd>
<dl>
<dt>1.1) Name: </dt>
<dd>Company name.</dd>
<dt>1.2) ID: </dt>
<dd>Number.</dd>
<dt>1.3) Address: </dt>
<dd>San Bernardo, 45.</dd>
<dt>1.4) City: </dt>
<dd>Madrid.</dd>
<dt>1.5) Province: </dt>
<dd>Madrid.</dd>
<dt>1.6) Postal Code: </dt>
<dd>28015.</dd>
<dt>1.7) Country: </dt>
<dd>Spain.</dd>
<dt>1.8) Code: </dt>
<dd>587463.</dd>
<dt>1.9) Phone: </dt>
<dd> 34 PhoneNumber.</dd>
<dt>1.11) email: </dt>
<dd>[email protected]</dd>
<dt>1.13) Buyer Address: </dt>
<dd>https://example_url.com</dd>
</dl>
</dd>
<dt>2. Organization:</dt>
<dd>
<dl>
<dt>2.1) Type: </dt>
<dd>Administration</dd>
<dt>2.2) Activity: </dt>
<dd>Administration.</dd>
</dl>
</dd>
<dt>4. Codes:</dt>
<dd>
<dl>
<dt>4.1) Main Code: </dt>
<dd>72000000 (Services: software, internet), 72260000 (Software related services) and 72600000 (Informatic services).</dd>
<dt>4.2) Code 2</dt>
<dd>72000000 (Services: software, internet), 72260000 (Software related services) and 72600000 (Informatic services).</dd>
<dt>4.3) Code 3</dt>
<dd>72000000 (Services: software, internet), 72260000 (Software related services) and 72600000 (Informatic services).</dd>
<dt>4.4) Code 4</dt>
<dd>72000000 (Services: software, internet), 72260000 (Software related services) and 72600000 (Informatic services).</dd>
</dl>
</dd>
</text>
'''
soup = BeautifulSoup(html)
data = list(soup.select_one('dt:-soup-contains(" Codes:") dd dl').stripped_strings)
{k.split(') ')[-1]:v for k, v in zip(data[::2], data[1::2])}