Home > front end >  Use regex to find page numbers in two different format
Use regex to find page numbers in two different format

Time:04-21

I've many string that have two possible format to display page numbers: (pp. 4500-4503) or just 4500-4503 (there may be also cases where I have only one page so (pp. 113) or just 11 .

Some exemples of strings:

- Mitchell, J.A. (2017). Citation: Why is it so important. Mendeley Journal, 67(2), (pp. 81-95). 

- Denhart, H. (2008). Deconstructing barriers: Perceptions of students labeled with learning disabilities in higher education. Journal of Learning Disabilities, 41, 483-497.

I'm using this regex for the first format:

r"pp\. \d -\d "

And this for the second one:

r"\d -\d "

Neither of them are working. I was also wondering: is there a way to use only one regex expression instead of creating two? Thank you

CodePudding user response:

This pattern matches all your different formats:

(\(pp\.)? \d (-\d )?\)?

https://regex101.com/r/HV7rlJ/2

CodePudding user response:

You might use:

\(pp\.\s \d (?:-\d )?\)|\b\d (?:-\d )?(?=(?:\s*,\s*\d (?:-\d )?)*\.)

Explanation

  • \(pp\.\s \d (?:-\d )?\)
  • | Or
  • \b A word boundary
  • \d (?:-\d )? Match 1 digits and optionally - and 1 digits
  • (?= Positive lookahead, assert what is to the right is
    • (?: Non capture group to repeat as a whole part
      • \s*,\s* Match a comma between optional whitespace chars
      • \d (?:-\d )? Match 1 digits and optionally - and 1 digits
    • )* Close the non capture group and optionally repeat it
    • \.
  • ) Close lookahead

See a regex demo and a Python demo.

Example

import re

pattern = r"\(pp\.\s \d (?:-\d )?\)|\b\d (?:-\d )?(?=(?:\s*,\s*\d (?:-\d )?)*\.)"

s = ("- Mitchell, J.A. (2017). Citation: Why is it so important. Mendeley Journal, 67(2), (pp. 81-95). \n\n"
            "- Denhart, H. (2008). Deconstructing barriers: Perceptions of students labeled with learning disabilities in higher education. Journal of Learning Disabilities, 41, 483-497.")

print(re.findall(pattern, s))

Output

['(pp. 81-95)', '41', '483-497']
  • Related