I have this text-
text = """<?xml version="1.0"?><mainmodule><module1><heading>Lesson 01: Design Authorization</heading><subheading><item1>Learning Objectives</item1><item2>Choosing an Authorization Approach</item2><item3>Access Management Solution</item3></subheading></module1><module2><heading>Lesson 02: Design a Solution for Logging and Monitoring</heading><subheading><item1>Learning Objectives</item1><item2>Monitoring Tools</item2><item3>Azure Monitor Health and Availability Monitoring</item3><item4>Initiating Automated Response Using Action Groups</item4><item5>Configure and Manage Alerts</item5><item6>Demo Azure Logging and Monitoring</item6><item7>Demo Azure Alerts</item7><item8>Recap</item8></subheading></module2><module3><heading>Lesson 03: Design for High Availability</heading><subheading><item1>Learning Objectives</item1><item2>Architecture Best Practices for Reliability into Categories</item2><item3>Solution for Recovery in Different Regions</item3><item4>Solution for Azure Backup Management</item4><item5>Solution for Data Archiving and Retention</item5></subheading></module3></mainmodule>"""
I would like the output in this format-
output = [{
'heading': 'Lesson 01: Design Authorization',
'subheading': [{'subheading': 'Learning Objectives'},
{'subheading': 'Choosing an Authorization Approach'},
{'subheading': 'Access Management Solution'}]},
{
'heading': 'Lesson 02: Design a Solution for Logging and Monitoring',
'subheading': [{'subheading': 'Learning Objectives'},
{'subheading': 'Monitoring Tools'},
{'subheading': 'Azure Monitor Health and Availability Monitoring'},
{'subheading': 'Initiating Automated Response Using Action Groups'},
{'subheading': 'Configure and Manage Alerts'},
{'subheading': 'Demo Azure Logging and Monitoring'},
{'subheading': 'Demo Azure Alerts'},
{'subheading': 'Recap'}]},
{
'heading': 'Lesson 03: Design for High Availability',
'subheading': [{'subheading': 'Learning Objectives'},
{'subheading': 'Architecture Best Practices for Reliability into Categories'},
{'subheading': 'Solution for Recovery in Different Regions'},
{'subheading': 'Solution for Azure Backup Management'},
{'subheading': 'Solution for Data Archiving and Retention'}]}
]
The text under tag needs to be under "heading" in output. And The text under "subheading" -> "item" needs to be under "subheading" under respective "heading".
I am trying to solve this by creating a list of lists . Till now I have done this, but I am unable to solve.
output = [
{
'heading':'',
'subheading':[
{
'subheading':''
}
]
}
]
import re
heading_list = re.findall('<heading>. ?</heading>',text)
subheading_list = re.findall('<item\d ?>. ?</item\d ?>',text)
no_of_items = 0
count_item = [[]]*len(heading_list)
for num,sub in enumerate(subheading_list):
if '<item1>' in sub and num!=0:
no_of_items =1
count_item[no_of_items].append(sub)
else:
count_item[no_of_items].append(sub)
I want to append the items in count_item's list of lists, but somehow every items are getting appended in every list. How can I solve this?
CodePudding user response:
Solution using beautifulsoup
:
from bs4 import BeautifulSoup
with open("your_file.xml", "r") as f_in:
soup = BeautifulSoup(f_in.read(), "xml")
out = []
for module in soup.find_all(lambda tag: tag.name.startswith("module")):
out.append({"heading": module.find("heading").text, "subheading": []})
for item in module.find_all(lambda tag: tag.name.startswith("item")):
out[-1]["subheading"].append({"subheading": item.text})
print(out)
Prints:
[
{
"heading": "Lesson 01: Design Authorization",
"subheading": [
{"subheading": "Learning Objectives"},
{"subheading": "Choosing an Authorization Approach"},
{"subheading": "Access Management Solution"},
],
},
{
"heading": "Lesson 02: Design a Solution for Logging and Monitoring",
"subheading": [
{"subheading": "Learning Objectives"},
{"subheading": "Monitoring Tools"},
{"subheading": "Azure Monitor Health and Availability Monitoring"},
{"subheading": "Initiating Automated Response Using Action Groups"},
{"subheading": "Configure and Manage Alerts"},
{"subheading": "Demo Azure Logging and Monitoring"},
{"subheading": "Demo Azure Alerts"},
{"subheading": "Recap"},
],
},
{
"heading": "Lesson 03: Design for High Availability",
"subheading": [
{"subheading": "Learning Objectives"},
{
"subheading": "Architecture Best Practices for Reliability into Categories"
},
{"subheading": "Solution for Recovery in Different Regions"},
{"subheading": "Solution for Azure Backup Management"},
{"subheading": "Solution for Data Archiving and Retention"},
],
},
]
CodePudding user response:
In your code, count_item = [[]]*len(heading_list)
ends up creating len(heading_list)
copies of the same empty list. This means that whatever element of count_item
you append to, you'll end up appending to the same list object.
Try something like count_item = [[] for _ in heading_list]
.
CodePudding user response:
Have you tried the library xmltodict
? This is a neat library that converts the extensible markup language data to a python dictionary..
# !pip install xmltodict
import xmltodict
text = """<?xml version="1.0"?><mainmodule><module1><heading>Lesson 01: Design Authorization</heading><subheading><item1>Learning Objectives</item1><item2>Choosing an Authorization Approach</item2><item3>Access Management Solution</item3></subheading></module1><module2><heading>Lesson 02: Design a Solution for Logging and Monitoring</heading><subheading><item1>Learning Objectives</item1><item2>Monitoring Tools</item2><item3>Azure Monitor Health and Availability Monitoring</item3><item4>Initiating Automated Response Using Action Groups</item4><item5>Configure and Manage Alerts</item5><item6>Demo Azure Logging and Monitoring</item6><item7>Demo Azure Alerts</item7><item8>Recap</item8></subheading></module2><module3><heading>Lesson 03: Design for High Availability</heading><subheading><item1>Learning Objectives</item1><item2>Architecture Best Practices for Reliability into Categories</item2><item3>Solution for Recovery in Different Regions</item3><item4>Solution for Azure Backup Management</item4><item5>Solution for Data Archiving and Retention</item5></subheading></module3></mainmodule>"""
d = xmltodict.parse(text)
After that you can process the data
output = []
for module in d['mainmodule'].keys():
dic = {}
dic['heading'] = d['mainmodule'][module]['heading']
lis = []
for item in d['mainmodule'][module]['subheading']:
lis.append({'subheading': d['mainmodule'][module]['subheading'][item]})
dic['subheading'] = lis
output.append(dic)
print(output)
Output:
[{'heading': 'Lesson 01: Design Authorization',
'subheading': [{'subheading': 'Learning Objectives'},
{'subheading': 'Choosing an Authorization Approach'},
{'subheading': 'Access Management Solution'}]},
{'heading': 'Lesson 02: Design a Solution for Logging and Monitoring',
'subheading': [{'subheading': 'Learning Objectives'},
{'subheading': 'Monitoring Tools'},
{'subheading': 'Azure Monitor Health and Availability Monitoring'},
{'subheading': 'Initiating Automated Response Using Action Groups'},
{'subheading': 'Configure and Manage Alerts'},
{'subheading': 'Demo Azure Logging and Monitoring'},
{'subheading': 'Demo Azure Alerts'},
{'subheading': 'Recap'}]},
{'heading': 'Lesson 03: Design for High Availability',
'subheading': [{'subheading': 'Learning Objectives'},
{'subheading': 'Architecture Best Practices for Reliability into Categories'},
{'subheading': 'Solution for Recovery in Different Regions'},
{'subheading': 'Solution for Azure Backup Management'},
{'subheading': 'Solution for Data Archiving and Retention'}]}]