I've many string that have two possible format to display page numbers: (pp. 4500-4503)
or just 4500-4503
(there may be also cases where I have only one page so (pp. 113)
or just 11
.
Some exemples of strings:
- Mitchell, J.A. (2017). Citation: Why is it so important. Mendeley Journal, 67(2), (pp. 81-95).
- Denhart, H. (2008). Deconstructing barriers: Perceptions of students labeled with learning disabilities in higher education. Journal of Learning Disabilities, 41, 483-497.
I'm using this regex for the first format:
r"pp\. \d -\d "
And this for the second one:
r"\d -\d "
Neither of them are working. I was also wondering: is there a way to use only one regex expression instead of creating two? Thank you
CodePudding user response:
This pattern matches all your different formats:
(\(pp\.)? \d (-\d )?\)?
https://regex101.com/r/HV7rlJ/2
CodePudding user response:
You might use:
\(pp\.\s \d (?:-\d )?\)|\b\d (?:-\d )?(?=(?:\s*,\s*\d (?:-\d )?)*\.)
Explanation
\(pp\.\s \d (?:-\d )?\)
|
Or\b
A word boundary\d (?:-\d )?
Match 1 digits and optionally-
and 1 digits(?=
Positive lookahead, assert what is to the right is(?:
Non capture group to repeat as a whole part\s*,\s*
Match a comma between optional whitespace chars\d (?:-\d )?
Match 1 digits and optionally-
and 1 digits
)*
Close the non capture group and optionally repeat it\.
)
Close lookahead
See a regex demo and a Python demo.
Example
import re
pattern = r"\(pp\.\s \d (?:-\d )?\)|\b\d (?:-\d )?(?=(?:\s*,\s*\d (?:-\d )?)*\.)"
s = ("- Mitchell, J.A. (2017). Citation: Why is it so important. Mendeley Journal, 67(2), (pp. 81-95). \n\n"
"- Denhart, H. (2008). Deconstructing barriers: Perceptions of students labeled with learning disabilities in higher education. Journal of Learning Disabilities, 41, 483-497.")
print(re.findall(pattern, s))
Output
['(pp. 81-95)', '41', '483-497']