Home > Enterprise >  Extract page and page number with Python regex
Extract page and page number with Python regex

Time:11-25

I want to extract page and page number from the URL with regex. There are couple of variations of page number:

fghghdsfs/page4
fghghdsfs/page-4
sfgsfgsfg/page=4
hteheth/page-4/
dhdghgd/page=4/
dghdghdh/page/4/
dghdghdh/page/4
fghghdsfs?page4
dhdghd?page-4
dghdg?page-4/
eyeyt?page=4
etyetyet?page=4/
nvnndgnd?page/4/
dghdghdh/page/4

Number of page should have between 1 and 3 digits.

I have tried with this regex, but I have a problem with identifying /:

(=|\?|\/)(page)(_|-|=|\d{1,3}|\/)

CodePudding user response:

There are two problems with the regex you have:

  1. \d{1,3} is inside the parentheses. You're saying: page followed by either a separator or by the page number. Put it after the parentheses, and make it a capture group so you can extract it later.
  2. The group with separators is required, so page4 does not match. Put a ? after the group.

Fixing those:

(=|\?|\/)(page)(_|-|=|\/)?(\d{1,3})

See it in action on regex101.

CodePudding user response:

You may use this regex:

[=?/]page[_=/-]?(\d{1,3})

RegEx Demo

RegEx Details:

  • [=?/]: Match = or ? or /
  • page: Match string page
  • [_=/-]?: Optionally match _ or = or / or -
  • (\d{1,3}): Match 1 to 3 digits
  • Related