How to make list of lists from a flat list given specific elements in Python-CodePudding

There is a way to "unflatten" a list in Python (see, for example, HERE). However, how to do that efficiently given specific elements? Here is a slightly altered beginning of Jane Austen's "Pride and Prejudice":

Austen = """ONE: It is a truth universally acknowledged, ONE: that a single man in possession
of a good fortune, must be in want of a wife.

TWO: However little known the feelings or views of such a man may be on his
first entering a neighbourhood, ONE: this truth is so well fixed in the minds
of the surrounding families, THREE: that he is considered as the rightful
property of some one or other of their daughters.

TWO: "My dear Mr. Bennet," said his lady to him one day, ONE: "have you heard that
Netherfield Park is let at last?"
"""

Note, there are some added points:

BREAK_POINTS = ('ONE:', 'TWO:', 'THREE:')

Using RegEx Tokenizer from nltk it is quite easy to get the list of word tokens:

from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\s ', gaps=True)

tokens = []
for line in Austen.splitlines():
    if line == '':
        continue
    tokens  = tokenizer.tokenize(line)

tokens
['ONE:', 'It', 'is', 'a', 'truth', 'universally',
'acknowledged,', 'ONE:', 'that', 'a', 'single', 'man', 'in',
'possession', 'of', 'a', 'good', 'fortune,', 'must', 'be', 'in',
'want', 'of', 'a', 'wife.', 'TWO:', 'However', 'little', 'known',
'the', 'feelings', 'or', 'views', 'of', 'such', 'a', 'man', 'may',
'be', 'on', 'his', 'first', 'entering', 'a', 'neighbourhood,',
'ONE:', 'this', 'truth', 'is', 'so', 'well', 'fixed', 'in', 'the',
'minds', 'of', 'the', 'surrounding', 'families,', 'THREE:', 'that',
'he', 'is', 'considered', 'as', 'the', 'rightful', 'property', 'of',
'some', 'one', 'or', 'other', 'of', 'their', 'daughters.', 'TWO:',
'"My', 'dear', 'Mr.', 'Bennet,"', 'said', 'his', 'lady', 'to',
'him', 'one', 'day,', 'ONE:', '"have', 'you', 'heard', 'that',
'Netherfield', 'Park', 'is', 'let', 'at', 'last?"']

How can I "unflatten" that list using BREAK_POINT. In particular, if a BREAK_POINT repeats, like the first two 'ONE', that should be ignored.

[['ONE:', 'It', 'is', 'a', 'truth', 'universally', 'acknowledged,',
'that', 'a', 'single', 'man', 'in', 'possession', 'of', 'a', 'good',
'fortune,', 'must', 'be', 'in', 'want', 'of', 'a', 'wife.'],
['TWO:', 'However', 'little', 'known', 'the', 'feelings', 'or',
'views', 'of', 'such', 'a', 'man', 'may', 'be', 'on', 'his',
'first', 'entering', 'a', 'neighbourhood,'], ['ONE:', 'this',
'truth', 'is', 'so', 'well', 'fixed', 'in', 'the', 'minds', 'of',
'the', 'surrounding', 'families,'], ['THREE:', 'that', 'he', 'is',
'considered', 'as', 'the', 'rightful', 'property', 'of', 'some',
'one', 'or', 'other', 'of', 'their', 'daughters.'], ['TWO:', '"My',
'dear', 'Mr.', 'Bennet,"', 'said', 'his', 'lady', 'to', 'him',
'one', 'day,'], ['ONE:', '"have', 'you', 'heard', 'that',
'Netherfield', 'Park', 'is', 'let', 'at', 'last?"']]

CodePudding user response：

I'd first build a list with all positions of breakpoints with their index in the list. Then use itertools.groupby to avoid adjacent duplicates and return only the indices of the start of each "new" list. Build the new list by iteration through that index list with itertools.zip_longest.

import itertools

points = [(i,x) for i, x in enumerate(token) if x in BREAK_POINTS]

points_cleaned = [list(group)[0][0] for _, group in itertools.groupby(points, key=lambda tup: tup[1])]
print(points_cleaned)
# [0, 25, 45, 59, 76, 88]

result = [token[start:end] for start,end in itertools.zip_longest(points_cleaned, points_cleaned[1:])]
print(result)

Output:

[
    ['ONE', 'It', 'is', 'a', 'truth', 'universally', 'acknowledged', 'ONE', 'that', 'a', 'single', 'man', 'in', 'possession', 'of', 'a', 'good', 'fortune', 'must', 'be', 'in', 'want', 'of', 'a', 'wife'], 
    ['TWO', 'However', 'little', 'known', 'the', 'feelings', 'or', 'views', 'of', 'such', 'a', 'man', 'may', 'be', 'on', 'his', 'first', 'entering', 'a', 'neighbourhood'], 
    ['ONE', 'this', 'truth', 'is', 'so', 'well', 'fixed', 'in', 'the', 'minds', 'of', 'the', 'surrounding', 'families'], 
    ['THREE', 'that', 'he', 'is', 'considered', 'as', 'the', 'rightful', 'property', 'of', 'some', 'one', 'or', 'other', 'of', 'their', 'daughters'], 
    ['TWO', 'My', 'dear', 'Mr', 'Bennet', 'said', 'his', 'lady', 'to', 'him', 'one', 'day'], 
    ['ONE', 'have', 'you', 'heard', 'that', 'Netherfield', 'Park', 'is', 'let', 'at', 'last']
]