Home > Software design >  Tokenize paragraphs by special characters; then rejoin so tokenized segments to reach certain length
Tokenize paragraphs by special characters; then rejoin so tokenized segments to reach certain length

Time:06-07

I have this long paragraph:

paragraph = "The weakening of the papacy by the Avignon exile and the Papal Schism; the breakdown of monastic discipline and clerical celibacy; the luxury of prelates, the corruption of the Curia, the worldly activities of the popes; the morals of Alexander VI, the wars of Julius II, the careless gaiety of Leo X; the relicmongering and peddling of indulgences; the triumph of Islam over Christendom in the Crusades and the Turkish wars; the spreading acquaintance with non-Christian faiths; the influx of Arabic science and philosophy; the collapse of Scholasticism in the irrationalism of Scotus and the skepticism of Ockham; the failure of the conciliar movement to effect reform; the discovery of pagan antiquity and of America; the invention of printing; the extension of literacy and education; the translation and reading of the Bible; the newly realized contrast between the poverty and simplicity of the Apostles and the ceremonious opulence of the Church; the rising wealth and economic independence of Germany and England; the growth of a middle class resentful of ecclesiastical restrictions and claims; the protests against the flow of money to Rome; the secularization of law and government; the intensification of nationalism and the strengthening of monarchies; the nationalistic influence of vernacular languages and literatures; the fermenting legacies of the Waldenses, Wyclif, and Huss; the mystic demand for a less ritualistic, more personal and inward and direct religion: all these were now uniting in a torrent of forces that would crack the crust of medieval custom, loosen all standards and bonds, shatter Europe into nations and sects, sweep away more and more of the supports and comforts of traditional beliefs, and perhaps mark the beginning of the end for the dominance of Christianity in the mental life of European man."

My goal is to split this long paragraph into multiple sentences keeping the sentences around 18 - 30 words each.

There is only one full-stop at the end; so nltk tokenizer is of no use. I can use regex to tokenize; I have this pattern that works in splitting:

regex_special_chars = '([″;*"(§=!‡…†\\?\\]‘)¿♥[] )'
new_text = re.split(regex_special_chars, paragraph)

The question is how to join this paragraph into a list of multiple sentences that would be around 18 to 30; where possible; because sometimes it's not possible with this regex.

The end result will look like the following list below:

tokenized_paragraph = ['The weakening of the papacy by the Avignon exile and the Papal Schism; the breakdown of monastic discipline and clerical celibacy;',
 'the luxury of prelates, the corruption of the Curia, the worldly activities of the popes; the morals of Alexander VI, the wars of Julius II, the careless gaiety of Leo X;',
 'the relicmongering and peddling of indulgences; the triumph of Islam over Christendom in the Crusades and the Turkish wars; the spreading acquaintance with non-Christian faiths; ',
 'the influx of Arabic science and philosophy; the collapse of Scholasticism in the irrationalism of Scotus and the skepticism of Ockham; the failure of the conciliar movement to effect reform; ',
 'the discovery of pagan antiquity and of America; the invention of printing; the extension of literacy and education; the translation and reading of the Bible; ',
 'the newly realized contrast between the poverty and simplicity of the Apostles and the ceremonious opulence of the Church; the rising wealth and economic independence of Germany and England;',
 'the growth of a middle class resentful of ecclesiastical restrictions and claims; the protests against the flow of money to Rome; the secularization of law and government; ',
 'the intensification of nationalism and the strengthening of monarchies; the nationalistic influence of vernacular languages and literatures; the fermenting legacies of the Waldenses, Wyclif, and Huss;',
 'the mystic demand for a less ritualistic, more personal and inward and direct religion: all these were now uniting in a torrent of forces that would crack the crust of medieval custom, loosen all standards and bonds, shatter Europe into nations and sects, sweep away more and more of the supports and comforts of traditional beliefs, and perhaps mark the beginning of the end for the dominance of Christianity in the mental life of European man.']

if we check the lengths of the end result; we get this many words into each tokenized segment:

[len(sent.split()) for sent in tokenized_paragraph]
[21, 31, 25, 30, 25, 29, 27, 26, 76]

Only the last segment exceeded 30 words (76 words), and that's okay!

Edit The regex could include a colon : So the last segment would be less than 76

CodePudding user response:

I would suggest using findall instead of split.

Then the regex could be:

(?:\S \s )*?(?:\S \s ){17,29}\S (?:$|[″;*"(§=!‡…†\?\]‘)¿♥[] )

Break-down:

  • \S \s a word and the space(s) that follow it
  • (?:\S \s )*?(?:\S \s ){17,29}: lazily match some words followed by a space (so initially it wont match any) and then greedily match as many words as possible up to 29, but at least 17, and all that ending with white space. The first lazy match is needed for when no match completes with just the greedy part.
  • \S (?:$|[″;*"(§=!‡…†\?\]‘)¿♥[] ): match one more word, terminated by a terminator character, or the end of the string.

So:

regex = r'(?:\S \s )*?(?:\S \s ){18,30}\S (?:$|[″;*"(§=!‡…†\?\]‘)¿♥[] )'
new_text = re.findall(regex, paragraph)

for line in new_text:
    print(len(line.split()), line)

The number of words per paragraph are:

[21, 31, 25, 30, 25, 29, 27, 26, 76]
  • Related