Home > other >  Can't get looping through regex list to work
Can't get looping through regex list to work

Time:05-01

    import re
othello_full = open('C:/Users/.../Othello.txt', encoding="mbcs").read()
split_dialogue = othello_full.split("\n\n")
dict = {}
for i in split_dialogue: 
      m = re.match(r'(BRABANTIO|GRATIANO|LODOVICO|OTHELLO|CASSIO|IAGO|MONTANO|RODERIGO|CLOWN|DESDEMONA|EMILIA|BIANCA).*(\.$|\?$|!$)', i)
      if bool(m) == True:
           dict[i.split(".", maxsplit = 1)[0]] = i.split(".", maxsplit = 1)[1]
      else:
            print('boo') #for purely diagnostic purpose

I am trying to create a dictionary and have the loop insert character name and their respective dialogue. I tested the regex expression and it works (at least for my limited samples). I tested each component individually, and they work. But they don't work inside the loop. Why? Also, is there a more elegant way rather than having all the characters' name in the regex?

Source of the download: https://www.gutenberg.org/ebooks/1531

Sample of the input text

['\n*** START OF THE PROJECT GUTENBERG EBOOK OTHELLO, THE MOOR OF VENICE ***',
 'cover ',
 '',
 'OTHELLO, THE MOOR OF VENICE',
 '',
 'by William Shakespeare',
 '',
 'Contents',
 'ACT I\nScene I. Venice. A street.\nScene II. Venice. Another street.\nScene III. Venice. A council chamber.',
 '\nACT II\nScene I. A seaport in Cyprus. A Platform.\nScene II. A street.\nScene III. A Hall in the Castle.',
 '\nACT III\nScene I. Cyprus. Before the Castle.\nScene II. Cyprus. A Room in the Castle.\nScene III. Cyprus. The Garden of the Castle.\nScene IV. Cyprus. Before the Castle.',
 '\nACT IV\nScene I. Cyprus. Before the Castle.\nScene II. Cyprus. A Room in the Castle.\nScene III. Cyprus. Another Room in the Castle.',
 '\nACT V\nScene I. Cyprus. A Street.\nScene II. Cyprus. A Bedchamber in the castle.',
 '',
 'Dramatis Personæ',
 'DUKE OF VENICE\nBRABANTIO, a Senator of Venice and Desdemona’s father\nOther Senators\nGRATIANO, Brother to Brabantio\nLODOVICO, Kinsman to Brabantio\nOTHELLO, a noble Moor in the service of Venice\nCASSIO, his Lieutenant\nIAGO, his Ancient\nMONTANO, Othello’s predecessor in the government of Cyprus\nRODERIGO, a Venetian Gentleman\nCLOWN, Servant to Othello',
 'DESDEMONA, Daughter to Brabantio and Wife to Othello\nEMILIA, Wife to Iago\nBIANCA, Mistress to Cassio',
 'Officers, Gentlemen, Messenger, Musicians, Herald, Sailor, Attendants,\n&c.',
 'SCENE: The First Act in Venice; during the rest of the Play at a\nSeaport in Cyprus.',
 '\nACT I',
 'SCENE I. Venice. A street.',
 ' Enter Roderigo and Iago.',
 'RODERIGO.\nTush, never tell me, I take it much unkindly\nThat thou, Iago, who hast had my purse,\nAs if the strings were thine, shouldst know of this.',
 'IAGO.\n’Sblood, but you will not hear me.\nIf ever I did dream of such a matter,\nAbhor me.',
 'RODERIGO.\nThou told’st me, thou didst hold him in thy hate.',
 'IAGO.\nDespise me if I do not. Three great ones of the city,\nIn personal suit to make me his lieutenant,\nOff-capp’d to him; and by the faith of man,\nI know my price, I am worth no worse a place.\nBut he, as loving his own pride and purposes,\nEvades them, with a bombast circumstance,\nHorribly stuff’d with epithets of war:\nAnd in conclusion,\nNonsuits my mediators: for “Certes,â€\x9d says he,\n“I have already chose my officer.â€\x9d\nAnd what was he?\nForsooth, a great arithmetician,\nOne Michael Cassio, a Florentine,\nA fellow almost damn’d in a fair wife,\nThat never set a squadron in the field,\nNor the division of a battle knows\nMore than a spinster, unless the bookish theoric,\nWherein the toged consuls can propose\nAs masterly as he: mere prattle without practice\nIs all his soldiership. But he, sir, had the election,\nAnd I, of whom his eyes had seen the proof\nAt Rhodes, at Cyprus, and on other grounds,\nChristian and heathen, must be belee’d and calm’d\nBy debitor and creditor, this counter-caster,\nHe, in good time, must his lieutenant be,\nAnd I, God bless the mark, his Moorship’s ancient.',
 'RODERIGO.\nBy heaven, I rather would have been his hangman.',
 'IAGO.\nWhy, there’s no remedy. ’Tis the curse of service,\nPreferment goes by letter and affection,\nAnd not by old gradation, where each second\nStood heir to the first. Now sir, be judge yourself\nWhether I in any just term am affin’d\nTo love the Moor.',
 'RODERIGO.\nI would not follow him, then.',
 'IAGO.\nO, sir, content you.\nI follow him to serve my turn upon him:\nWe cannot all be masters, nor all masters\nCannot be truly follow’d. You shall mark\nMany a duteous and knee-crooking knave\nThat, doting on his own obsequious bondage,\nWears out his time, much like his master’s ass,\nFor nought but provender, and when he’s old, cashier’d.\nWhip me such honest knaves. Others there are\nWho, trimm’d in forms, and visages of duty,\nKeep yet their hearts attending on themselves,\nAnd throwing but shows of service on their lords,\nDo well thrive by them, and when they have lin’d their coats,\nDo themselves homage. These fellows have some soul,\nAnd such a one do I profess myself. For, sir,\nIt is as sure as you are Roderigo,\nWere I the Moor, I would not be Iago:\nIn following him, I follow but myself.\nHeaven is my judge, not I for love and duty,\nBut seeming so for my peculiar end.\nFor when my outward action doth demonstrate\nThe native act and figure of my heart\nIn complement extern, ’tis not long after\nBut I will wear my heart upon my sleeve\nFor daws to peck at: I am not what I am.'

CodePudding user response:

I've made a regex to parse all of the text correctly.

import re
text_re = re.compile(
    r"(?<=\n\n)"  # Always 2 newlines before name.
    # Name consists of one or more capitalized words, followed by a dot.
    r"(?P<name>(?:[A-Z]  ?) )\.\n"
    # Dialogue consists of
    r"(?P<dialog>(?:"
    # One or more continuous lines.
    r"(?:[^ \n]. \n) "
    # Sometimes, actions such as "Enter" or [Exit] are included.
    r"(?:\n(?: (?:\[|Enter ). \n\n) "
    # But they always follow with lines who aren't names (all caps).
    r"(?=\w[^A-Z]))?)"
    # This can repeat multiple times until dialog ends.
    r" )")

The regex itself is complex but has some explanation.

In order to test it you can use:

with open("pg1531.txt", encoding="utf-8") as txtfile:
    text = txtfile.read()

for match in text_re.finditer(text):
    print("Name:", match.group("name"))
    print("Text:", match.group("dialog"))
    print()
    input()

Press <Enter> multiple times to see the dialog continuing.

You can then use it to map dialogs to people as you see fit:

import collections
dialogs = collections.defaultdict(list)
for match in text_re.finditer(text):
    dialogs[match.group("name")].append(match.group("dialog"))

And extract the first 10 dialogs by Montano:

print(dialogs["MONTANO"][:10])

There aren't many caveats. The regex is complex but unlike simple solutions, it prevent unnecessary text such as act numbers or actions from entering the dialog. I haven't stripped off the entering and exiting in the middle of dialogs, as it's important to comprehend the dialog, but you can easily strip it off if deemed necessary.

  • Related