Home > Software engineering >  Converting text to python list
Converting text to python list

Time:07-31

I have this text-

text = """<?xml version="1.0"?><mainmodule><module1><heading>Lesson 01: Design Authorization</heading><subheading><item1>Learning Objectives</item1><item2>Choosing an Authorization Approach</item2><item3>Access Management Solution</item3></subheading></module1><module2><heading>Lesson 02: Design a Solution for Logging and Monitoring</heading><subheading><item1>Learning Objectives</item1><item2>Monitoring Tools</item2><item3>Azure Monitor Health and Availability Monitoring</item3><item4>Initiating Automated Response Using Action Groups</item4><item5>Configure and Manage Alerts</item5><item6>Demo Azure Logging and Monitoring</item6><item7>Demo Azure Alerts</item7><item8>Recap</item8></subheading></module2><module3><heading>Lesson 03: Design for High Availability</heading><subheading><item1>Learning Objectives</item1><item2>Architecture Best Practices for Reliability into Categories</item2><item3>Solution for Recovery in Different Regions</item3><item4>Solution for Azure Backup Management</item4><item5>Solution for Data Archiving and Retention</item5></subheading></module3></mainmodule>"""

I would like the output in this format-

output = [{
  'heading': 'Lesson 01: Design Authorization',
  'subheading': [{'subheading': 'Learning Objectives'},
   {'subheading': 'Choosing an Authorization Approach'},
   {'subheading': 'Access Management Solution'}]},
{
  'heading': 'Lesson 02: Design a Solution for Logging and Monitoring',
  'subheading': [{'subheading': 'Learning Objectives'},
   {'subheading': 'Monitoring Tools'},
   {'subheading': 'Azure Monitor Health and Availability Monitoring'},
   {'subheading': 'Initiating Automated Response Using Action Groups'},
   {'subheading': 'Configure and Manage Alerts'},
    {'subheading': 'Demo Azure Logging and Monitoring'},
    {'subheading': 'Demo Azure Alerts'},
    {'subheading': 'Recap'}]},
{
  'heading': 'Lesson 03: Design for High Availability',
  'subheading': [{'subheading': 'Learning Objectives'},
   {'subheading': 'Architecture Best Practices for Reliability into Categories'},
   {'subheading': 'Solution for Recovery in Different Regions'},
   {'subheading': 'Solution for Azure Backup Management'},
   {'subheading': 'Solution for Data Archiving and Retention'}]}
]

The text under tag needs to be under "heading" in output. And The text under "subheading" -> "item" needs to be under "subheading" under respective "heading".

I am trying to solve this by creating a list of lists . Till now I have done this, but I am unable to solve.

output = [
    {
        'heading':'',
        'subheading':[
            {
                'subheading':''
            }
        ]
    }
]

import re
heading_list = re.findall('<heading>. ?</heading>',text)
subheading_list = re.findall('<item\d ?>. ?</item\d ?>',text)

no_of_items = 0

count_item = [[]]*len(heading_list)

for num,sub in enumerate(subheading_list):

    if '<item1>' in sub and num!=0:
        no_of_items =1

        count_item[no_of_items].append(sub)

    else:
        count_item[no_of_items].append(sub)

I want to append the items in count_item's list of lists, but somehow every items are getting appended in every list. How can I solve this?

CodePudding user response:

Solution using beautifulsoup:

from bs4 import BeautifulSoup

with open("your_file.xml", "r") as f_in:
    soup = BeautifulSoup(f_in.read(), "xml")

out = []
for module in soup.find_all(lambda tag: tag.name.startswith("module")):
    out.append({"heading": module.find("heading").text, "subheading": []})
    for item in module.find_all(lambda tag: tag.name.startswith("item")):
        out[-1]["subheading"].append({"subheading": item.text})

print(out)

Prints:

[
    {
        "heading": "Lesson 01: Design Authorization",
        "subheading": [
            {"subheading": "Learning Objectives"},
            {"subheading": "Choosing an Authorization Approach"},
            {"subheading": "Access Management Solution"},
        ],
    },
    {
        "heading": "Lesson 02: Design a Solution for Logging and Monitoring",
        "subheading": [
            {"subheading": "Learning Objectives"},
            {"subheading": "Monitoring Tools"},
            {"subheading": "Azure Monitor Health and Availability Monitoring"},
            {"subheading": "Initiating Automated Response Using Action Groups"},
            {"subheading": "Configure and Manage Alerts"},
            {"subheading": "Demo Azure Logging and Monitoring"},
            {"subheading": "Demo Azure Alerts"},
            {"subheading": "Recap"},
        ],
    },
    {
        "heading": "Lesson 03: Design for High Availability",
        "subheading": [
            {"subheading": "Learning Objectives"},
            {
                "subheading": "Architecture Best Practices for Reliability into Categories"
            },
            {"subheading": "Solution for Recovery in Different Regions"},
            {"subheading": "Solution for Azure Backup Management"},
            {"subheading": "Solution for Data Archiving and Retention"},
        ],
    },
]

CodePudding user response:

In your code, count_item = [[]]*len(heading_list) ends up creating len(heading_list) copies of the same empty list. This means that whatever element of count_item you append to, you'll end up appending to the same list object.

Try something like count_item = [[] for _ in heading_list].

CodePudding user response:

Have you tried the library xmltodict? This is a neat library that converts the extensible markup language data to a python dictionary..

# !pip install xmltodict
import xmltodict

text = """<?xml version="1.0"?><mainmodule><module1><heading>Lesson 01: Design Authorization</heading><subheading><item1>Learning Objectives</item1><item2>Choosing an Authorization Approach</item2><item3>Access Management Solution</item3></subheading></module1><module2><heading>Lesson 02: Design a Solution for Logging and Monitoring</heading><subheading><item1>Learning Objectives</item1><item2>Monitoring Tools</item2><item3>Azure Monitor Health and Availability Monitoring</item3><item4>Initiating Automated Response Using Action Groups</item4><item5>Configure and Manage Alerts</item5><item6>Demo Azure Logging and Monitoring</item6><item7>Demo Azure Alerts</item7><item8>Recap</item8></subheading></module2><module3><heading>Lesson 03: Design for High Availability</heading><subheading><item1>Learning Objectives</item1><item2>Architecture Best Practices for Reliability into Categories</item2><item3>Solution for Recovery in Different Regions</item3><item4>Solution for Azure Backup Management</item4><item5>Solution for Data Archiving and Retention</item5></subheading></module3></mainmodule>"""
d = xmltodict.parse(text)

After that you can process the data

output = []
for module in d['mainmodule'].keys():
    dic = {}
    dic['heading'] = d['mainmodule'][module]['heading']
    lis = []
    for item in d['mainmodule'][module]['subheading']:
        lis.append({'subheading': d['mainmodule'][module]['subheading'][item]})
    dic['subheading'] = lis
    output.append(dic)
    
print(output)

Output:

[{'heading': 'Lesson 01: Design Authorization',
  'subheading': [{'subheading': 'Learning Objectives'},
   {'subheading': 'Choosing an Authorization Approach'},
   {'subheading': 'Access Management Solution'}]},
 {'heading': 'Lesson 02: Design a Solution for Logging and Monitoring',
  'subheading': [{'subheading': 'Learning Objectives'},
   {'subheading': 'Monitoring Tools'},
   {'subheading': 'Azure Monitor Health and Availability Monitoring'},
   {'subheading': 'Initiating Automated Response Using Action Groups'},
   {'subheading': 'Configure and Manage Alerts'},
   {'subheading': 'Demo Azure Logging and Monitoring'},
   {'subheading': 'Demo Azure Alerts'},
   {'subheading': 'Recap'}]},
 {'heading': 'Lesson 03: Design for High Availability',
  'subheading': [{'subheading': 'Learning Objectives'},
   {'subheading': 'Architecture Best Practices for Reliability into Categories'},
   {'subheading': 'Solution for Recovery in Different Regions'},
   {'subheading': 'Solution for Azure Backup Management'},
   {'subheading': 'Solution for Data Archiving and Retention'}]}]
  • Related