I OCR read multiple images (photos of pages) which need to be grouped into logical units (chapters). I have individual page txt documents and a txt document of all the OCR'd text from all pages. I need to split the text into seperate chapters and save as new .txt files. I can identify first pages by a keyword that always occurs on the first page.
I've read in the full text of all pages and split by page, so that I have a list of strings, where one string (element in the list) is the text from one page, in the correct order as they appear in the book. I'm trying to group the pages in this list, like in this this GeeksforGeeks question, with the modification to keep the keyword. So for instance, if the keyword is hi, the test list is ['hi','yes', 'hi you', 'me']
the result should be [['hi','yes'], ['hi you', 'me']]
.
This is what I've tried so far, but using iter skips some indices and I don't know how to solve the issue otherwise.
# initialize lists
test_list = ['is the', 'best', 'and', 'is so', 'popular']
# Desired outcome: [['is the', 'best', 'and'], ['is so', 'popular']]
indices = [0]
firsts = [x for x in test_list if 'is' in x]
for f in firsts:
temp = test_list.index(f)
indices.append(temp)
print(indices)
it = iter(indices)
for i in it:
print(i)
new = test_list[i:next(it)]
CodePudding user response:
try this code ,it's working
# initialize lists
test_list = ['is the', 'best', 'and', 'is so', 'popular']
# Desired outcome: [['is the', 'best', 'and'], ['is so', 'popular']]
indices = []
for value in test_list:
if "is" not in value:
indices[-1].append(value)
else:
indices.append([value])
print(indices)
it's output is
[['is the', 'best', 'and'], ['is so', 'popular']]
CodePudding user response:
Use a single loop with a temporary list:
test_list = ['is the', 'best', 'and', 'is so', 'popular']
out = []
tmp = []
for item in test_list:
if item.startswith('is ') and tmp:
out.append(tmp)
tmp = []
tmp.append(item)
out.append(tmp)
print(out)
or with a generator:
def group(l):
tmp = []
for item in test_list:
if item.startswith('is ') and tmp:
yield tmp
tmp = []
tmp.append(item)
yield tmp
out = list(group(test_list))
print(out)
output:
[['is the', 'best', 'and'], ['is so', 'popular']]