I have a texte file like following
FAKE ET FAKE
1, rue Fake - 99567 FAKE
Tél, :00 99 89 22 34 © [email protected]
FAKE-ET-FAKE.fr
FAKE AGAIN
2, rue Fake - 99567 FAKE
Tél, :00 99 89 22 34 © [email protected]
FAKE-AGAIN.fr
STILL FAKE AGAIN ANOTHER
2, rue Fake - 99567 FAKE
Tél, :00 99 89 22 34 © [email protected]
STILL-FAKE-AGAIN-ANOTHER.fr
with a regex
I want to extract the header of each paragraph.
I know that the pattern of the header is to be upper case separated with space but the number of upper case words and spaces is different
I have tried this but problem is I do not manage to make it work wathever the number of pattern "UPPER UPPER UPPER ..."
here what I have tried:
regex = r'[A-Z] \s[A-Z] '
re.findall(regex, text)
Here I would only find "FAKE AGAIN" in my example.
I have tried
regex = r'([A-Z] \s[A-Z] ) '
to say that this pattern of UPPER\s
can reproduce but did not work
CodePudding user response:
With x.txt
containing your text, the following worked fine for me:
egrep -e '^[A-Z][A-Z ]*$' x.txt
CodePudding user response:
Use
regex = r'[A-Z] (?:[^\S\n][A-Z] ) '
See regex proof.
EXPLANATION
- [A-Z] - Match a single character in the range between A (index 65) and Z (index 90) (case sensitive) between one and unlimited times, as many times as possible, giving back as needed (greedy)
- Non-capturing group (?:[^\S\n][A-Z] )
- - matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy)
- Match a single character not present in the list below [^\S\n]
- \S - matches any non-whitespace character (equivalent to [^\r\n\t\f\v ])
- \n - matches a line-feed (newline) character (ASCII 10)
- Match a single character present in the list below [A-Z]
- - matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy)
- A-Z - matches a single character in the range between A (index 65) and Z (index 90) (case sensitive)