Home > Net >  How to parse list of messy strings in python
How to parse list of messy strings in python

Time:12-02

I'm trying to extract some ID's and their status from an xml file and I have reached a point where I have a list of strings containing this info and I just need to extract the codes and the status(pass or fail). The problem is, the strings are extremely messy and I'm a newbie to python so I'm not sure how to do this. The piece of code that needs to be looked at:

l = len(res_smaller)
print(res_smaller)
for field in range(1, l, 2):
    aux = res_smaller[field]
    print(aux)

Output:

<big class="Heading3">1.2 <a name="i__786909744_34">Test Case CODE_2096: RANDOMTEXT</a>: Failed</big>
<big class="Heading3">1.3 <a name="i__786909744_1424">Test Case CODE_2101: RANDOMTEXT</a>: Failed</big>
<big class="Heading3">1.4 <a name="i__786909744_2814">Test Case CODE_2111: RANDOMTEXT</a>: Failed</big>   
<big class="Heading3">1.5 <a name="i__786909744_2850">Test Case CODE_2098: RANDOMTEXT</a>: Failed</big>

I used the BeautifulSoup library to find_all Heading3 classes, parsed a bit more and now I have a list from which a printed the lines that are of interest to me (this is why I used an increment of 2 from 1). My idea is to create a dictionary of the form: CODE_NUMBER: STATUS, but I don't know how to extract from each field these things. My idea was to use aux.split(" ") to split them by the whitespace delimiter, and extract the 5th and 7th element from each field, but this gives me an error so I'm not sure if this is possible in python. Any ideas?

EDIT: Here's the code with the aux.split, I've also added the list printed as a whole:

l = len(res_smaller)
print(res_smaller)

for field in range(1, l, 2):
    aux = res_smaller[field]
    print(aux.split(" "))  

Output:

[<big class="Heading3">1.1 <a name="i__786909744_13">RANDOMTEXT</a>: Passed</big>, <big class="Heading3">1.2 <a name="i__786909744_34">Test Case CODE_2096: RANDOMTEXT</a>: Failed</big>, <big class="Heading3">Main Part of Test Case</big>, <big class="Heading3">1.3 <a name="i__786909744_1424">Test Case CODE_2101: RANDOMTEXT</a>: Failed</big>, <big class="Heading3">Main Part of Test Case</big>, <big class="Heading3">1.4 <a name="i__786909744_2814">Test Case CODE_2111: RANDOMTEXT</a>: Failed</big>, 
<big class="Heading3">Main Part of Test Case</big>, <big class="Heading3">1.5 <a name="i__786909744_2850">Test Case CODE_2098: RANDOMTEXT</a>: Failed</big>, <big class="Heading3">Main Part of Test Case</big>]
Traceback (most recent call last):
  File "D:\Code\Python\Projects\HTML_parser.py", line 43, in <module>
    print(aux.split(" "))
TypeError: 'NoneType' object is not callable

CodePudding user response:

Highly suggest using findall in re module. Since the input is not included, I am working with what I have:

import re
l = len(res_smaller)
print(res_smaller)
my_dict = {}
for field in range(1, l, 2):
    aux = res_smaller[field]
    status = re.findall('</a>: (.*?)</big>', aux, re.DOTALL)
    code = re.findall('Case (.*?):', aux, re.DOTALL)
    my_dict[code[0]] = status[0]
print(my_dict)

output:

{'CODE_2096': 'Failed', 'CODE_2101': 'Failed', 'CODE_2111': 'Failed', 'CODE_2098': 'Failed'}
  • Related