How do I extract with regex all the text (numbers, letters, symbols) after the second capital letter-CodePudding

They won.             Elles gagnèrent.
They won.    Ils ont gagné.
They won.        Elles ont gagné.
Tom came.    Tom est venu.
Tom died.       Tom est mort.
Tom knew. Tom savait.
Tom left.    Tom est parti.
Tom left.       Tom partit.
Tom lied. Tom a menti.
Tom lies.    Tom ment.
Tom lost.            Tom a perdu.
Tom paid.    Tom a payé.

I'm having some trouble putting together a regex pattern that extracts all the text after the second capital letter (including it).

For example:

They won.             Elles gagnèrent.

in this case you should extract:

Elles gagnèrent.

This is my code, but it is not working well:

import re

line = "They won.             Elles gagnèrent." #for example this case

match = re.search(r"\s¿?(?:A|Á|B|C|D|E|É|F|G|H|I|Í|J|K|LL|L|M|N|Ñ|O|Ó|P|Q|R|S|T|U|Ú|V|W|X|Y|Z)\s((?:\w\s) )?" , line)

n_sense = match.group()

print(repr(n_sense)) #should print "Elles gagnèrent."

CodePudding user response：

You may try the following codes.

with open(file, "r") as r:
    for line in r:
        line = re.sub('^[^A-Z]*[A-Z][^A-Z]*','', line)
        print(line, end="")

CodePudding user response：

Here it goes the regex: .*[A-Z].*[A-Z]([^\n] )

The parenthesis wraps the text you want, which is called group. YOu will find out about it and how it works in python easily.

But the better this is providing a tool https://regex101.com/

CodePudding user response：

You can search for the match as you describe it:

[A-Z].*?([A-Z].*)

That's an uppercase letter, followed by zero or more of anything, followed by another uppercase followed by anything, capturing the last group:

import unicodedata
import re

s = '''They won.             Elles gagnèrent.
They won.    Ils ont gagné.
They won.        Elles ont gagné.
Tom came.    Tom est venu.
Tom died.       Tom est mort.
Tom knew. Tom savait.
Tom left.    Tom est parti.
Tom left.       Tom partit.
Tom lied. Tom a menti.
Tom lies.    Tom ment.
Âom lost.            Étienne a perdu.  # << note accents
Tom paid.    Tom a payé.'''


s = unicodedata.normalize('NFD', s)
re.findall(r'[A-Z].*?([A-Z].*)', s, re.UNICODE)

Which will give you:

['Elles gagnèrent.',
 'Ils ont gagné.',
 'Elles ont gagné.',
 'Tom est venu.',
 'Tom est mort.',
 'Tom savait.',
 'Tom est parti.',
 'Tom partit.',
 'Tom a menti.',
 'Tom ment.',
 'Étienne a perdu.',
 'Tom a payé.']

If all those spaces are part of the actual text, may be easier to match those or split. The re.UNICODE flag will allow it to match uppercase letters with accents like Étienne, but you need to make sure the unicode is normalized first.