Parse ascii table header-CodePudding

So I need to parse this into dataframe or list:

tmp =
[' -------------- ----------------------------------------- ',
 '| Something to |        Some header with subheader       |',
 '|  watch or     ----------------- ----------------------- ',
 '|     idk      |      First      |   another text again  |',
 '|              |                 |  with one more line   |',
 '|              |                  ----------------------- ',
 '|              |                 |  and this | how it be |',
 ' -------------- ----------------- ----------------------- ']

It is just txt table with strange header. I need to transform it to this:

['Something to watch or idk', 'Some header with subheader First', 'Some header with subheader another text again with one more line and this', 'Some header with subheader another text again with one more line how it be']

Here's my first solution that make me closer to victory (you can see the comments my tries):

pluses = [i for i, element in enumerate(tmp) if element[0] == ' ']
tmp2 = tmp[pluses[0]:pluses[1] 1].copy()
table_str=''.join(tmp[pluses[0]:pluses[1] 1])
col=[[i for i, symbol in enumerate(line) if symbol == ' ' or symbol == '|'] for line in tmp2]

tmp3=[]
strt = ''.join(tmp2.copy())
table_list = [l.strip().replace('\n', '') for l in re.split(r'\ [ -] ', strt) if l.strip()]
for row in table_list:
    joined_row = ['' for _ in range(len(row))]
    for lines in [line for line in row.split('||')]:
        line_part = [i.strip() for i in lines.split('|') if i]
        joined_row = [i   j for i, j in zip(joined_row, line_part)]
        tmp3.append(joined_row)

here's out:

tmp3
out[4]:
[['Something to', 'Some header with subheader'],
 ['Something towatch or'],
 ['idk', 'First', 'another text again'],
 ['idk', 'First', 'another text againwith one more line'],
 ['idk'],
 ['', '', 'and this', 'how it be']]

Remains only join this in the right way but idk how to...

Here's addon: We can locate pluses and splitters by this:

col=[[i for i, symbol in enumerate(line) if symbol == ' ' or symbol == '|'] for line in tmp2]
[[0, 15, 57],
 [0, 15, 57],
 [0, 15, 33, 57],
 [0, 15, 33, 57],
 [0, 15, 33, 57],
 [0, 15, 33, 57],
 [0, 15, 33, 45, 57],
 [0, 15, 33, 57]]

And then we can split or group by cell but idk how to too... Please help

Example No.2:

 ---------- ------------------------------------------------------------ --------------- ---------------------------------- -------------------- ----------------------- 
|   Number |       longtextveryveryloooooong                            |  aaaaaaaaaaa  |         bbbbbbbbbbbbbbbbbb       |    dfsdfgsdfddd    |qqqqqqqqqqqqqqqqqqqqqq |
| string   |                                                            |               |        ccccccccccccccccccccc     |    affasdd  as     |qqqqqqqqqqqqqqqqqqqqqq |
|          |                                                            |               | eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee,|    seeerrrr   e,   |   dfsdfffffffffffff   |
|          |                                                            |               | anothertext and something        |       percent      |   ttttttttttttttttt   |
|          |                                                            |               |           (nothingtodo),         |                    | sssssssssssssssssssss |
|          |                                                            |               |             and text             |                    |zzzzzzzzzzzzzzzzzzzzzz |
|          |                                                            |                ----------------------------------                     | b rererereerr ppppppp |
|          |                                                            |               |     all         | longtext wit-  |                    |                       |
|          |                                                            |               |                 |h many character|                    |                       |
 ---------- ------------------------------------------------------------ --------------- ----------------- ---------------- -------------------- -----------------------

CodePudding user response：

As the input is in the format of a reStructuredText table, you could use the docutils table parser.

import docutils.parsers.rst.tableparser
from collections.abc import Iterable

def extract_texts(tds):
    " recursively extract StringLists and join"
    texts = []
    for e in tds:
        if isinstance(e, docutils.statemachine.StringList):
            texts.append(' '.join([s.strip() for s in list(e) if s]))
            break
        if isinstance(e, Iterable):
            texts.append(extract_texts(e))
    return texts

>>> parser = docutils.parsers.rst.tableparser.GridTableParser()
>>> tds = parser.parse(docutils.statemachine.StringList(tmp))
>>> extract_texts(tds)

[[],
 [],
 [[['Something to watch or idk'], ['Some header with subheader']],
  [['First'], ['another text again with one more line']],
  [['and this | how it be']]]]

then flatten.

For a more general usage, it is interesting to give a look in tds (the structure returned by parse): some documentation there

CodePudding user response：

You could do it recursively - parsing each "sub table" at a time:

def parse_table(table, header='', table_len=None):
    # store length of original table
    if not table_len:
        table_len = len(table)

    col = table[0].find(' ', 1)
    rows = [
        row for row in range(1, len(table))
            if table[row].startswith(' ')
            and table[row][col] == ' '
    ]
    row = rows[0]

    # split lines into "columns"
    columns = (line[1:col].split('|') for line in table[1:row])

    # rebuild each column appending to header
    content = [
        ' '.join([header]   [line.strip() for line in lines]).strip()
        for lines in zip(*columns)
    ]

    # parse table below
    if row   2 < len(table):
        header = content[-1]
        # if we are not the last table - we are a header
        if len(rows) > 1:
            header = content.pop()
        next_table = [line[:col   1] for line in table[row:]]
        #print('\n'.join(next_table))
        content.extend(parse_table(next_table, header=header, table_len=table_len))

    # parse table to the right
    if col   2 < len(table[0]):
        # reset the header if we are the "top" table in new column
        if len(table) == table_len:
            header = ''
        next_table = [line[col:] for line in table]
        #print('\n'.join(next_table))
        content.extend(
            parse_table(next_table, header=header, table_len=table_len)
        )

    return content

Output:

>>> parse_table(table)
['Something to watch or idk',
 'Some header with subheader First',
 'Some header with subheader another text again with one more line and this',
 'Some header with subheader another text again with one more line how it be']
>>> parse_table(big_table)
['Number string',
 'longtextveryveryloooooong',
 'aaaaaaaaaaa',
 'bbbbbbbbbbbbbbbbbb ccccccccccccccccccccc eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee, anothertext and something (nothingtodo), and text all',
 'bbbbbbbbbbbbbbbbbb ccccccccccccccccccccc eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee, anothertext and something (nothingtodo), and text longtext wit- h many character',
 'dfsdfgsdfddd affasdd  as seeerrrr   e, percent',
 'qqqqqqqqqqqqqqqqqqqqqq qqqqqqqqqqqqqqqqqqqqqq dfsdfffffffffffff ttttttttttttttttt sssssssssssssssssssss zzzzzzzzzzzzzzzzzzzzzz b rererereerr ppppppp']