Home > OS >  How to find pattern upper case sentence
How to find pattern upper case sentence

Time:04-27

I have a texte file like following

FAKE ET FAKE
1, rue Fake - 99567 FAKE
Tél, :00 99 89 22 34 © [email protected]
FAKE-ET-FAKE.fr

FAKE AGAIN
2, rue Fake - 99567 FAKE
Tél, :00 99 89 22 34 © [email protected]
FAKE-AGAIN.fr 

STILL FAKE AGAIN ANOTHER
2, rue Fake - 99567 FAKE
Tél, :00 99 89 22 34 © [email protected]
STILL-FAKE-AGAIN-ANOTHER.fr 

with a regex I want to extract the header of each paragraph. I know that the pattern of the header is to be upper case separated with space but the number of upper case words and spaces is different

I have tried this but problem is I do not manage to make it work wathever the number of pattern "UPPER UPPER UPPER ..."

here what I have tried:

regex = r'[A-Z] \s[A-Z] '
re.findall(regex, text)

Here I would only find "FAKE AGAIN" in my example.

I have tried

regex = r'([A-Z] \s[A-Z] ) '

to say that this pattern of UPPER\s can reproduce but did not work

CodePudding user response:

With x.txt containing your text, the following worked fine for me:

egrep -e '^[A-Z][A-Z ]*$' x.txt

CodePudding user response:

Use

regex = r'[A-Z] (?:[^\S\n][A-Z] ) '

See regex proof.

EXPLANATION

 - [A-Z]  - Match a single character in the range between A (index 65) and Z (index 90) (case sensitive) between one and unlimited times, as many times as possible, giving back as needed (greedy)
 - Non-capturing group (?:[^\S\n][A-Z] ) 
   -   - matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy)
   - Match a single character not present in the list below [^\S\n]
      - \S - matches any non-whitespace character (equivalent to [^\r\n\t\f\v ])
      - \n - matches a line-feed (newline) character (ASCII 10)
   - Match a single character present in the list below [A-Z]
      -   -  matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy)
      - A-Z - matches a single character in the range between A (index 65) and Z (index 90) (case sensitive)
  • Related