Home > Mobile >  How would I convert this PCRE regex to the ECMAScript (JS) Regx for parsing street number and addres
How would I convert this PCRE regex to the ECMAScript (JS) Regx for parsing street number and addres

Time:02-12

I have been looking for the best regex that would parse the street number and name from an address. I found one (https://regex101.com/r/lU7gY7/1), but it is in PCRE instead of JavaScript. I have been playing around with it for quite some time now, but I can't seem to get the same (or any) output with JavaScript. There are comments in the link above explaining the code a little more, but below is the version with no comments.

\A\s*(?:(?:(?P<A_Addition_to_address_1>.*?),\s*)?(?:No\.\s*)?(?P<A_House_number_1>\pN [a-zA-Z]?(?:\s*[-\/\pP]\s*\pN [a-zA-Z]?)*)\s*,?\s*(?P<A_Street_name_1>(?:[a-zA-Z]\s*|\pN\pL{2,}\s\pL)\S[^,#]*?(?<!\s))\s*(?:(?:[,\/]|(?=\#))\s*(?!\s*No\.)(?P<A_Addition_to_address_2>(?!\s).*?))?
|
(?:(?P<B_Addition_to_address_1>.*?),\s*(?=.*[,\/]))?
(?!\s*No\.)(?P<B_Street_name>\S\s*\S(?:[^,#](?!\b\pN \s))*?(?<!\s))\s*[\/,]?\s*(?:\sNo\.)?\s (?P<B_House_number>\pN \s*-?[a-zA-Z]?(?:\s*[-\/\pP]?\s*\pN (?:\s*[\-a-zA-Z])?)*|[IVXLCDM] (?!.*\b\pN \b))(?<!\s)\s*(?:(?:[,\/]|(?=\#)|\s)\s*(?!\s*No\.)\s*(?P<B_Addition_to_address_2>(?!\s).*?))?)\s*\Z

These addresses are the set of addresses that I'm trying to parse:

100 Baker Street
109 - 111 Wharfside Street
40-42 Parkway
25b-26 Sun Street
43a Garden Walk
6/7 Marine Road
10 - 12 Acacia Ave
4513 3RD STREET CIRCLE WEST
0 1/2 Fifth Avenue
194-03 1/2 50th Avenue

I have tried to enter them into the regex sites and switched to JavaScript instead of PCRE and have corrected the issues that the sites highlight, but that doesn't seem to work. I have also tried the code snippets to convert PCRE to JS, but that hasn't worked either. I'm thinking there are some fundamental differences that I am missing. Could someone help me out here with the conversion?

Update: The goal is to have this as the end result:

{A_House_number_1: "109 - 111", A_Street_name_1: "Wharfside Street"}
{A_House_number_1: "40-42", A_Street_name_1: "Parkway"}
{A_House_number_1: "25b-26", A_Street_name_1: "Sun Street"}
{A_House_number_1: "43a",  A_Street_name_1: "Garden Walk"}
{A_House_number_1: "6/7", A_Street_name_1: "Marine Road"}
{A_House_number_1: "10 - 12", A_Street_name_1: "Acacia Ave"}
{A_House_number_1: "4513", A_Street_name_1: "3RD STREET CIRCLE WEST"}
{A_House_number_1: "0 1/2", A_Street_name_1: "Fifth Avenue"}
{A_House_number_1: "194-03 1/2", A_Street_name_1: "50th Avenue"}

CodePudding user response:

In your regex, there are Unicode property classes whose syntax is not compliant with ECMAScript 2018 standard, and the named capturing group syntax is different across the two engines. More, since you have to use Unicode property classes in the ECMAScript regex, you need to use the /u flag, and it requires a stricter approach to escaping special chars, so you need to make sure you only escape what you must escape. Besides, \A and \Z / \z anchors are not supported in ECMAScript regex flavor, just use ^ and $.

Here are examples of what is changed:

  • Removed the comments as the COMMENT / FREESPACING mode (usually enabled with /x or (?x) flags/options) is not supported in ECMAScript regex
  • (?P<A_Addition_to_address_1>.*?) => (?<A_Addition_to_address_1>.*?) (the P after ? is not supported)
  • \pN => \p{N} (the Unicode category name/alias must appear inside curly braces)
  • [\-a-zA-Z] => [-a-zA-Z] and \# => # (unnecessary escapes, mandated by the use of /u flag)
  • \A => ^ (unsupported anchor)
  • \Z => $ (unsupported anchor)

You can use

const addresses = ['109 - 111 Wharfside Street', '40-42 Parkway', '25b-26 Sun Street', '43a Garden Walk', '6/7 Marine Road', '10 - 12 Acacia Ave', '4513 3RD STREET CIRCLE WEST', '0 1/2 Fifth Avenue', '194-03 1/2 50th Avenue'];
const regex = /^\s*(?:(?:(?<A_Addition_to_address_1>.*?),\s*)?(?:No\.\s*)?(?<A_House_number_1>\p{N} [a-zA-Z]?(?:\s*[-\/\p{P}]\s*\p{N} [a-zA-Z]?)*)\s*,?\s*(?<A_Street_name_1>(?:[a-zA-Z]\s*|\p{N}\p{L}{2,}\s\p{L})\S[^,#]*?(?<!\s))\s*(?:(?:[,\/]|(?=#))\s*(?!\s*No\.)(?<A_Addition_to_address_2>(?!\s).*?))?|(?:(?<B_Addition_to_address_1>.*?),\s*(?=.*[,\/]))?(?!\s*No\.)(?<B_Street_name>\S\s*\S(?:[^,#](?!\b\p{N} \s))*?(?<!\s))\s*[\/,]?\s*(?:\sNo\.)?\s (?<B_House_number>\p{N} \s*-?[a-zA-Z]?(?:\s*[-\/\p{P}]?\s*\p{N} (?:\s*[-a-zA-Z])?)*|[IVXLCDM] (?!.*\b\p{N} \b))(?<!\s)\s*(?:(?:[,\/]|(?=#)|\s)\s*(?!\s*No\.)\s*(?<B_Addition_to_address_2>(?!\s).*?))?)\s*$/u;
for (const address of addresses) {
  const m = regex.exec(address);
  if (m) {
      console.log( Object.fromEntries(Object.entries(m.groups).filter(([k,v]) => v!==undefined)) )
  } else {
      console.log( `No match found in "${address}"` )
  }
}

See the regex demo.

Inside the code, Object.fromEntries(Object.entries(m.groups).filter(([k,v]) => v!==undefined)) is used to remove all the named capturing groups where the value is undefined.

  • Related