Home > Mobile >  Regular expression to extract string from urls
Regular expression to extract string from urls

Time:04-15

I need to extract a string from an URL. Here are some examples:

Input: https://www.example.net/eur_en/bas-026-009-basic-baby-hat-beige.html – Output: bas-026-009

Input: https://www.example.net/eur_en/aw18-245-b86-big-cherries-snow-jacket-plum-red.html – Output: aw18-245-b86

Input: https://www.example.net/eur_en/ss20-028-e70-hearts-tee-off-white-yellow.html – Output: ss20-028-e70

I want to be able to extract the string that goes from the first character after the "/eur_en/" until the third dash. Can someone help me? Thanks

CodePudding user response:

The expression you're looking for is the following:

/(?<=eur_en\/)[^-]*-[^-]*-[^-]*/

Here is how it works:

  • (?<=eur_en\/): will look behind for eur_env/ but will not use it in the output
  • [^-]*: it will match any character that is not a dash. So it will get everything up to the first dash (not including the dash)
  • [^-]*: it will match any character that is not a dash. So it will get everything up to the second dash (not including the dash)
  • [^-]*: it will match any character that is not a dash. So it will get everything up to the third dash (not including the dash).

CodePudding user response:

You're looking for regexp: \/eur_en\/([^-] -[^-] -[^-] )

Play & test it at regex101: https://regex101.com/r/RvGROG/1

You need something like this:

const urls = [
"https://www.example.net/eur_en/bas-026-009-basic-baby-hat-beige.html",
"https://www.example.net/eur_en/aw18-245-b86-big-cherries-snow-jacket-plum-red.html",
"https://www.example.net/eur_en/ss20-028-e70-hearts-tee-off-white-yellow.html",
]

const rg = new RegExp(`\/eur_en\/([^-] -[^-] -[^-] )`)
const strs = urls.map(url => url.match(rg)[1])

console.log(strs)
// Output:
// [
//  "bas-026-009",
//  "aw18-245-b86",
//  "ss20-028-e70"
// ]

Of course, it's a simple example. In real cases don't forget to check that .match returned array with length greater than 1. So, the first element is full captured string and the second (as third and next) it's a sub-strings, which is captured by parentheses.

We can improve and complicate our regex like so: \/((?:[^-\/] -){2}[^-\/] ) It'll allow us to not to use a specific anchor /eur_en/ and control the number of dash divided parts.

CodePudding user response:

/(?<=\/eur_en\/)\w -\w -\w /g
Tolkens Description
(?<=\/eur_en\/) Look behind - If /eur_en/ is found, match whatever proceeds it.
\w -\w -\w One or more Word character = [A-Za-z0-9] and a literal hyphen three consecutive times.

Review: https://regex101.com/r/Ge0zA3/1

  • Related