How do I parse numeric tables from a text file using templates in Python?-CodePudding

I would like to extract a series of tables from a text file. The file looks something like the following. The table heading follows a regular pattern, and there is a blank line at the end of the table. Eventually I want the table in a Numpy array, but if I can get the lines of numeric data isolated, then converting to an array is easy.

Contents of example.txt:

lines to ignore

Table AAA

-  ----
1  3.5
3  6.8
5  9.933


more lines to ignore
more lines to ignore

Table BBB

-  ----
2  5.0
5  6.8
99  9.933

even more lines to ignore

From this, I would like a list, something like:

[ 
   { 'id' : 'AAA', data : [[1,3.5],[3,6.8],[5,9.933]]},
   { 'id' : 'BBB', data : [[2,5.0],[5,6.8],[99,9.933]]},
]

I have written plenty of one-off parsers for this, but I'd like to do something with templates based on what I've seen in the ttp Python package. Unfortunately for me, that package seems to be focused on networking configuration files, so none of the examples are that close to what I'm wanting to do.

If there is a better Python package to use, I'm open to suggestions.

Here is what I've started with:

import ttp

template = """
<group name="table data" method="table">

Table {{ tab_name }}
{{ x1 | ROW }}

</group>
"""

lines = ''.join(open('example.txt').readlines())

parser = ttp.ttp(data=lines, template=template)
parser.parse()

res = parser.result()
print(res)

But this doesn't separate the tables or ignore the interspersed lines of text.

In [11]: res
Out[11]:
[[{'table data': [{'x1': 'lines to ignore'},
    {'tab_name': 'AAA'},
    {'x1': '-  ----'},
    {'x1': '1  3.5'},
    {'x1': '3  6.8'},
    {'x1': '5  9.933'},
    {'x1': 'more lines to ignore'},
    {'x1': 'more lines to ignore'},
    {'tab_name': 'BBB'},
    {'x1': '-  ----'},
    {'x1': '2  5.0'},
    {'x1': '5  6.8'},
    {'x1': '99  9.933'},
    {'x1': 'even more lines to ignore'}]}]]

CodePudding user response：

No need to find a package that does the job, you can use regular expression for that :

import re

def isolate_tables(text: str) -> dict:
    tables = []

    lines = iter(line.strip() for line in text.split("\n"))

    while True:
        try:
            match_table_name = None
            while match_table_name is None:
                match_table_name = re.match(r"Table\s (. )$", next(lines))

            table_name, = match_table_name.groups()
            table_data = []

            tables.append((table_name, table_data))

            match_header = None
            while match_header is None:
                match_header = re.match(r"^[-\s] $", next(lines))

            match_data_line = True
            while match_data_line:
                match_data_line = re.split("\s ", next(lines))
                if len(match_data_line) > 1:
                    table_data.append(match_data_line)
                else:
                    match_data_line = False
        
        except StopIteration:
            break

    return tables

isolate_tables(example)
# [('AAA', [['1', '3.5'], ['3', '6.8'], ['5', '9.933']]), ('BBB', [['2', '5.0'], ['5', '6.8'], ['99', '9.933']])]

Il let you adapt the output to your needs

CodePudding user response：

Hope this would help little bit:

template = """
<group name="table data" method="table">

Table {{ tab_name }}

{{D | ROW | contains('.')| split(" ") }}


</group>
"""

lines = ''.join(open('t1.txt').readlines())

parser = ttp.ttp(data=lines, template=template)
parser.parse()

res = parser.result(format='json')[0]
print(res)