I'm new to Python, and I’m using Python to extract lines said by certain characters in Shakespeare’s plays. I'm using a .txt file of Romeo and Juliet which essentially works as follows:
Jul. Wilt thou be gone? It is not yet near day. It was the nightingale, and not the lark, That pierc'd the fearful hollow of thine ear. Nightly she sings on yond pomegranate tree. Believe me, love, it was the nightingale.
Rom. It was the lark, the herald of the morn; No nightingale. Look, love, what envious streaks Do lace the severing clouds in yonder East. Night's candles are burnt out, and jocund day Stands tiptoe on the misty mountain tops. I must be gone and live, or stay and die.
Jul. Yond light is not daylight; I know it, I. It is some meteor that the sun exhales To be to thee this night a torchbearer And light thee on the way to Mantua. Therefore stay yet; thou need'st not to be gone.
Rom. Let me be ta'en, let me be put to death. I am content, so thou wilt have it so. I'll say yon grey is not the morning's eye, 'Tis but the pale reflex of Cynthia's brow; Nor that is not the lark whose notes do beat The vaulty heaven so high above our heads. I have more care to stay than will to go. Come, death, and welcome! Juliet wills it so. How is't, my soul? Let's talk; it is not day.
Jul. It is, it is! Hie hence, be gone, away! It is the lark that sings so out of tune, Straining harsh discords and unpleasing sharps. Some say the lark makes sweet division; This doth not so, for she divideth us. Some say the lark and loathed toad chang'd eyes; O, now I would they had chang'd voices too, Since arm from arm that voice doth us affray, Hunting thee hence with hunt's-up to the day! O, now be gone! More light and light it grows.
Rom. More light and light- more dark and dark our woes!
The assumption I've made is that a line is directed towards the character that spoke directly before. For example, I assume that the last line of this text (' More light and light- more dark and dark our woes!') is directed towards Juliet (or Jul.).
I'm trying to extract all the lines spoken by Romeo, which are directed towards Juliet, using Regular Expression. This is the code I have so far:
def get_sentences(full_text):
sentences = sent_tokenize(full_text.strip())
return sentences
sentences = get_sentences(full_text)
lines = []
for lines in sentences:
if re.findall("\ARom.",lines):
print(lines)
However, this only returns a list as follows:
Rom. Rom. Rom. Rom. etc.
I've been trying to figure out what to do for hours, but I can't figure out what my next step should be.
Any help is greatly appreciated!
CodePudding user response:
It looks like the pattern is that the first 'sentence' in lines is the characters name. So maybe you can split lines on the first period and take the first sentence as the name.
You could do that by using split() like:
character = lines.split('.')[0]
CodePudding user response:
You might read all lines at once, and with multiline enabled using re.M
write a pattern like:
^Rom\. .*(?:\n(?!(?:Rom|Jul)\. ).*)*
Explanation
^
Start of stringRom\.
MatchRom.
.*
Match the whole line(?:
Non capture group\n Match a newline
-(?!(?:Rom|Jul)\. ).*
Only match the whole line if it does not start withRom.
orJul.
)*
Optionally repeat the non capture group to match all lines
See a regex demo and a Python demo.