Home > Enterprise >  use beautifulsoup to parse a specific word from a tag in python
use beautifulsoup to parse a specific word from a tag in python

Time:06-26

i use beautifulsoup to parsing xml file so the parsing done by tag name but can i put another word for searching inside the tag?

  Data = soup.find_all('Data')
for Data in Data:
    Data = Data.get_text()

Data is name of the tag but can i select a word inside this tag to parsing it maybe like this

Data = soup.find_all("Data", name = '"ObjectClass')

for Data in Data: Data = Data.get_text() print (Data)

i tried this but get this error TypeError: Tag.find_all() got multiple values for argument 'name'

This is an XML example:

<Document>
  <Data Name="ObjectClass">computer</Data>
  <Data Name="AttributeLDAPDisplayName">ms-Mcs-AdmPwdExpirationTime</Data>
  <Data Name="ObjectClass">computer</Data>
  <Data Name="AttributeLDAPDisplayName">ms-Mcs-AdmPwdExpirationTime</Data>
</Document>

So I want to search on only name =object class

CodePudding user response:

This will get only Data tags with Name="ObjectClass". It requires pip install bs4 lxml for external libraries:

from bs4 import BeautifulSoup

xml = '''\
<Document>
  <Data Name="ObjectClass">computer</Data>
  <Data Name="AttributeLDAPDisplayName">ms-Mcs-AdmPwdExpirationTime</Data>
  <Data Name="ObjectClass">computer</Data>
  <Data Name="AttributeLDAPDisplayName">ms-Mcs-AdmPwdExpirationTime</Data>
  <Other Name="ObjectClass">other</Other>
</Document>
'''

soup = BeautifulSoup(xml,'xml')
for data in soup.find_all('Data',Name='ObjectClass'):
    print(data.get_text())

Output:

computer
computer

Note that case matters (Name not name).

CodePudding user response:

One liner without any external library :-)

import xml.etree.ElementTree as ET


xml = '''\
<Document>
  <Data Name="ObjectClass">computer1</Data>
  <Data Name="AttributeLDAPDisplayName">ms-Mcs-AdmPwdExpirationTime</Data>
  <Data Name="ObjectClass">compute2r</Data>
  <Data Name="AttributeLDAPDisplayName">ms-Mcs-AdmPwdExpirationTime</Data>
  <Other Name="ObjectClass">other</Other>
</Document>
'''
root = ET.fromstring(xml)
object_class_data = [x.text for x in root.findall('.//Data[@Name="ObjectClass"]')]
print(object_class_data)

output

['computer1', 'compute2r']
  • Related