I would like to extract a series of tables from a text file. The file looks something like the following. The table heading follows a regular pattern, and there is a blank line at the end of the table. Eventually I want the table in a Numpy array, but if I can get the lines of numeric data isolated, then converting to an array is easy.
Contents of example.txt
:
lines to ignore
Table AAA
- ----
1 3.5
3 6.8
5 9.933
more lines to ignore
more lines to ignore
Table BBB
- ----
2 5.0
5 6.8
99 9.933
even more lines to ignore
From this, I would like a list, something like:
[
{ 'id' : 'AAA', data : [[1,3.5],[3,6.8],[5,9.933]]},
{ 'id' : 'BBB', data : [[2,5.0],[5,6.8],[99,9.933]]},
]
I have written plenty of one-off parsers for this, but I'd like to do something with templates based on what I've seen in the ttp
Python package. Unfortunately for me, that package seems to be focused on networking configuration files, so none of the examples are that close to what I'm wanting to do.
If there is a better Python package to use, I'm open to suggestions.
Here is what I've started with:
import ttp
template = """
<group name="table data" method="table">
Table {{ tab_name }}
{{ x1 | ROW }}
</group>
"""
lines = ''.join(open('example.txt').readlines())
parser = ttp.ttp(data=lines, template=template)
parser.parse()
res = parser.result()
print(res)
But this doesn't separate the tables or ignore the interspersed lines of text.
In [11]: res
Out[11]:
[[{'table data': [{'x1': 'lines to ignore'},
{'tab_name': 'AAA'},
{'x1': '- ----'},
{'x1': '1 3.5'},
{'x1': '3 6.8'},
{'x1': '5 9.933'},
{'x1': 'more lines to ignore'},
{'x1': 'more lines to ignore'},
{'tab_name': 'BBB'},
{'x1': '- ----'},
{'x1': '2 5.0'},
{'x1': '5 6.8'},
{'x1': '99 9.933'},
{'x1': 'even more lines to ignore'}]}]]
CodePudding user response:
No need to find a package that does the job, you can use regular expression for that :
import re
def isolate_tables(text: str) -> dict:
tables = []
lines = iter(line.strip() for line in text.split("\n"))
while True:
try:
match_table_name = None
while match_table_name is None:
match_table_name = re.match(r"Table\s (. )$", next(lines))
table_name, = match_table_name.groups()
table_data = []
tables.append((table_name, table_data))
match_header = None
while match_header is None:
match_header = re.match(r"^[-\s] $", next(lines))
match_data_line = True
while match_data_line:
match_data_line = re.split("\s ", next(lines))
if len(match_data_line) > 1:
table_data.append(match_data_line)
else:
match_data_line = False
except StopIteration:
break
return tables
isolate_tables(example)
# [('AAA', [['1', '3.5'], ['3', '6.8'], ['5', '9.933']]), ('BBB', [['2', '5.0'], ['5', '6.8'], ['99', '9.933']])]
Il let you adapt the output to your needs
CodePudding user response:
Hope this would help little bit:
template = """
<group name="table data" method="table">
Table {{ tab_name }}
{{D | ROW | contains('.')| split(" ") }}
</group>
"""
lines = ''.join(open('t1.txt').readlines())
parser = ttp.ttp(data=lines, template=template)
parser.parse()
res = parser.result(format='json')[0]
print(res)