I want to split the text of a novel
into its chapters
and those chapters again into chucks of 1000 words
each.
I already created a list with my chapters but now I have no clue how to split those list-elements
automatically into another list, to create a list for each chapter with the respective chunks.
I can do it for one element but I'm really stuck here. I don't think that the solution is complicated but I just don't get it. (I assume some kind of loop would work?)
text = chapters[1]
text = text.split()
n = 1000
batch = [' '.join(text[i:i n]) for i in range(0,len(text),n)]
And would it be a better way to do work with dictionaries or a data frame? Thanks in advance!
CodePudding user response:
Is this what you're looking for?
Each chapter is in batch and then within that is an array of all the words separated by 1000 words. So to access the 1st thousand words, it'd be batch[0][0]
. Chapter 3, 5000-5999 words - batch[2][4]
.
batch=[]
for text in chapters:
print(text)
text = text.split()
n = 1000
batch.append([' '.join(text[i:i n]) for i in range(0,len(text),n)])
CodePudding user response:
if you want a one-liner, the result would be a list of lists of lists.
You provide a list of chapters, which will need to be converted in a list of chunks (another list) of size n, so the code would be something like this:
chapters = [
"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed maximus euismod turpis sit amet venenatis.",
"Nunc a volutpat enim, vel sollicitudin est. Maecenas semper condimentum scelerisque.",
]
n = 5
rc = [[s[i:i n] for i in range(0, len(s), n)] for c in chapters if (s := c.split())]
The result would look like this:
# List of chapters
[
# List of chunks
[
# List of words
['Lorem', 'ipsum', 'dolor', 'sit', 'amet,'],
['consectetur', 'adipiscing', 'elit.', 'Sed', 'maximus'],
['euismod', 'turpis', 'sit', 'amet', 'venenatis.']
],
[
['Nunc', 'a', 'volutpat', 'enim,', 'vel'],
['sollicitudin', 'est.', 'Maecenas', 'semper', 'condimentum'],
['scelerisque.']
],
]
In order to not compute split()
multiple times, you need to declare it as a variable.