How to get an element inside duplicate tag?-CodePudding

I'm pretty new to python and XML parsing. I need to parse an XML file from the Internet and I'm running into this problem, I don't know how to get the info I want because it is inside a generic tag used hundreds of times inside the document.

The XML file follows this structure:

<text>
  <dl>
    <dt>1. Information: </dt>
    <dd>
      <dl>
        <dt>1.1) Name: </dt>
        <dd>Company name.</dd>
        <dt>1.2) ID: </dt>
        <dd>Number.</dd>
        <dt>1.3) Address: </dt>
        <dd>San Bernardo, 45.</dd>
        <dt>1.4) City: </dt>
        <dd>Madrid.</dd>
        <dt>1.5) Province: </dt>
        <dd>Madrid.</dd>
        <dt>1.6) Postal Code: </dt>
        <dd>28015.</dd>
        <dt>1.7) Country: </dt>
        <dd>Spain.</dd>
        <dt>1.8) Code: </dt>
        <dd>587463.</dd>
        <dt>1.9) Phone: </dt>
        <dd> 34 PhoneNumber.</dd>
        <dt>1.11) email: </dt>
        <dd>[email protected]</dd>
        <dt>1.13) Buyer Address: </dt>
        <dd>https://example_url.com</dd>
      </dl>
    </dd>
    <dt>2. Organization:</dt>
    <dd>
      <dl>
        <dt>2.1) Type: </dt>
        <dd>Administration</dd>
        <dt>2.2) Activity: </dt>
        <dd>Administration.</dd>
      </dl>
    </dd>
    <dt>4. Codes:</dt>
    <dd>
      <dl>
        <dt>4.1) Main Code: </dt>
        <dd>72000000 (Services: software, internet), 72260000 (Software related services) and 72600000 (Informatic services).</dd>
        <dt>4.2) Code 2</dt>
        <dd>72000000 (Services: software, internet), 72260000 (Software related services) and 72600000 (Informatic services).</dd>
        <dt>4.3) Code 3</dt>
        <dd>72000000 (Services: software, internet), 72260000 (Software related services) and 72600000 (Informatic services).</dd>
        <dt>4.4) Code 4</dt>
        <dd>72000000 (Services: software, internet), 72260000 (Software related services) and 72600000 (Informatic services).</dd>
      </dl>
    </dd>
</text>

This is just a small fragment from the document. From this I want to get all codes inside the tag "Codes", in this case the output should be something similiar to: "72000000,72260000,72600000".

In this example all codes are the same in every "dd" tag, but since this should work on any file we retrieve from the web, I should get all codes and then eliminate the duplicated ones.

For parsing other tags on the document I'm currently using BeatifulSoup this way:

Parsing document:

soup = get_soup_from_url(url, "lxml")

def find_text_by_tag(self, soup, tag):
   item = soup.find(tag)
   return item.text if item else ''

The method get_soup_from_url() is getting imported from another file, here's the code:

from selenium import webdriver      # Selenium is a web testing library. It is used to automate browser activities.
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup       # Beautiful Soup is a Python package for parsing HTML and XML documents. It creates parse trees that is helpful to extract the data easily.

def get_driver_from_url(url):
    options = webdriver.ChromeOptions()
    options.add_argument("--no-sandbox") 
    options.add_argument("--disable-dev-shm-usage") 
    options.add_argument("--headless") 
    driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
    driver.get(url)
    return driver

def get_soup_from_url(url, parser):
    driver = get_driver_from_url(url)
    content = driver.page_source
    soup = BeautifulSoup(content, parser)
    return soup

def get_soup_from_driver(driver, parser):
    content = driver.page_source
    soup = BeautifulSoup(content, parser)
    return soup

def get_page_source_from_url(url):
    driver = get_driver_from_url(url)
    return driver.page_source

Thanks in advance to anyone that gives me any information on how to do this

CodePudding user response：

Based on your example HTML/XML select your elemtents with css selector:

data = list(soup.select_one('dt:-soup-contains(" Codes:")   dd dl').stripped_strings)

Then iterate the list to create a dict and use pattern as mentioned by @Sergey K to extract the digits only:

{k.split(') ')[-1]:re.findall(r'\d{8}', v) for k, v in zip(data[::2], data[1::2])}

That results in:

{'Main Code:': ['72000000', '72260000', '72600000'],
 'Code 2': ['72000000', '72260000', '72600000'],
 'Code 3': ['72000000', '72260000', '72600000'],
 'Code 4': ['72000000', '72260000', '72600000']}

EDIT

To get all as a single string with separator ,:

','.join(
    [
        c 
        for e in soup.select('dt:-soup-contains(" Codes:")   dd dd') 
        for c in re.findall(r'\d{8}', e.text)
    ]
)

Example

from bs4 import BeautifulSoup
import re

html = '''
<text>
  <dl>
    <dt>1. Information: </dt>
    <dd>
      <dl>
        <dt>1.1) Name: </dt>
        <dd>Company name.</dd>
        <dt>1.2) ID: </dt>
        <dd>Number.</dd>
        <dt>1.3) Address: </dt>
        <dd>San Bernardo, 45.</dd>
        <dt>1.4) City: </dt>
        <dd>Madrid.</dd>
        <dt>1.5) Province: </dt>
        <dd>Madrid.</dd>
        <dt>1.6) Postal Code: </dt>
        <dd>28015.</dd>
        <dt>1.7) Country: </dt>
        <dd>Spain.</dd>
        <dt>1.8) Code: </dt>
        <dd>587463.</dd>
        <dt>1.9) Phone: </dt>
        <dd> 34 PhoneNumber.</dd>
        <dt>1.11) email: </dt>
        <dd>[email protected]</dd>
        <dt>1.13) Buyer Address: </dt>
        <dd>https://example_url.com</dd>
      </dl>
    </dd>
    <dt>2. Organization:</dt>
    <dd>
      <dl>
        <dt>2.1) Type: </dt>
        <dd>Administration</dd>
        <dt>2.2) Activity: </dt>
        <dd>Administration.</dd>
      </dl>
    </dd>
    <dt>4. Codes:</dt>
    <dd>
      <dl>
        <dt>4.1) Main Code: </dt>
        <dd>72000000 (Services: software, internet), 72260000 (Software related services) and 72600000 (Informatic services).</dd>
        <dt>4.2) Code 2</dt>
        <dd>72000000 (Services: software, internet), 72260000 (Software related services) and 72600000 (Informatic services).</dd>
        <dt>4.3) Code 3</dt>
        <dd>72000000 (Services: software, internet), 72260000 (Software related services) and 72600000 (Informatic services).</dd>
        <dt>4.4) Code 4</dt>
        <dd>72000000 (Services: software, internet), 72260000 (Software related services) and 72600000 (Informatic services).</dd>
      </dl>
    </dd>
</text>
'''
soup = BeautifulSoup(html)
data = list(soup.select_one('dt:-soup-contains(" Codes:")   dd dl').stripped_strings)
{k.split(') ')[-1]:v for k, v in zip(data[::2], data[1::2])}