I couldn't find any information about reading ascii tables (prettytable-looking). I need to parse some tables that look like this:
-------- -------- -------------------
| Planets |
-------- -------- -------------------
| Planet | R (km) | mass (x 10^29 kg) |
-------- -------- -------------------
| Sun | 696000 | 1989100000 |
|(Solar) | | |
-------- -------- -------------------
| Earth | 6371 | 5973.6 |
-------- -------- -------------------
| Moon | 1737 | 73.5 |
-------- -------- -------------------
| Mars | 3390 | 641.85 |
-------- -------- -------------------
As you can see, this table contains heading and subtitle, but it isn't a main problem. Here how I tried to parse it (all tries was in Python, maybe exist more suitable language):
module petl
- doesn't work, coud not read-------- -------- -------------------
np.fromtxt
or smth doesnt work too, because there's no arraymodule asciitable
doesn't work,
asciitable.read("sample.txt",delimiter='|',guess=False,numpy=False, quotechar="'")
InconsistentTableError: Number of header columns (1) inconsistent with data columns (9) at data line 0
Header values: [' ---------- ------------------------------------------------------------ --------------- ----------------- ---------------- -------------------- ----------------------- ']
and of course I tried all the combinations of parametrs
- And yes, tried simple (but not) way and used regular expressions. But in case if "row" have more than 1 line there's too hard to catch all exceptions.
- Also tried simple
split
, but it unhappy way too...
I heard about numpy substitute by 0 and 1, but its too hard with my table. Please help.
CodePudding user response:
I think you could do this using python, with a few passes you can convert this into something that suits your needs.
table_str = """
-------- -------- -------------------
| Planets |
-------- -------- -------------------
| Planet | R (km) | mass (x 10^29 kg) |
-------- -------- -------------------
| Sun | 696000 | 1989100000 |
|(Solar) | | |
-------- -------- -------------------
| Earth | 6371 | 5973.6 |
-------- -------- -------------------
| Moon | 1737 | 73.5 |
-------- -------- -------------------
| Mars | 3390 | 641.85 |
-------- -------- -------------------
"""
table_list = [
[item.strip() for item in line.split('|') if item] # maintain the number of columns in rows.
for line in table_str.strip().split('\n')
if ' -' not in line # discard -
]
column_count = len(table_list[1])
row_idx_to_remove = list()
for row_idx, row in enumerate(table_list[2:], start=2):
if any(not c for c in row): # A multiline column entry
row_idx_to_remove.append(row_idx)
for col_idx, col in enumerate(table_list[row_idx - 1]):
table_list[row_idx - 1][col_idx] = row[col_idx]
# Remove the rows that have already been merged into previous ones.
for idx in row_idx_to_remove:
del table_list[idx]
table_dict = {'Name': table_list[0][0], 'Column Names': table_list[1], 'Rows': table_list[2:]}
print(table_dict)
print(table_list)
Output:
{'Name': 'Planets', 'Column Names': ['Planet', 'R (km)', 'mass (x 10^29 kg)'], 'Rows': [['Sun(Solar)', '696000', '1989100000'], ['Earth', '6371', '5973.6'], ['Moon', '1737', '73.5'], ['Mars', '3390', '641.85']]}
[['Planets'], ['Planet', 'R (km)', 'mass (x 10^29 kg)'], ['Sun(Solar)', '696000', '1989100000'], ['Earth', '6371', '5973.6'], ['Moon', '1737', '73.5'], ['Mars', '3390', '641.85']]
This implementation does not handle empty columns in rows nor does it handle the case where every column in a row has more than one line.
You could also make it use re.split, and split on lines containing -, then you could handle any rows that would contain the edge cases of this implementation.