Splitting addresses using the space after the postal code with regex Java-CodePudding

There are some raw rows with two or more addresses, I want to split them based on the last part of the Canadian postal code using a look-arround mechanism. The Canadian postal code format is A1A 1A1, where A is a letter and 1 is a digit, with a space separating the third and fourth characters.

Here is an example

160 Rue, Notre Dame N, Bureau 140, Sainte-Marie, G6E 3Z9 887 Chemin du Bord de l'Eau, Saint-Henri de Levis, G0R 3E0

I want to split the address based on the space after the last part of postal code if it exists The result:

160 Rue, Notre Dame N, Bureau 140, Sainte-Marie, G6E 3Z9
887 Chemin du Bord de l'Eau, Saint-Henri de Levis, G0R 3E0

I tried

List<String>  addresses = new ArrayList<String>();
addresses =  Arrays.asList(long_addresses.Address.split("(\\d\\w\\d)\\s"));

But the result is:

[, Rue, Notre Dame N, Bureau 140, Sainte-Marie, G6E , , Chemin du Bord de l'Eau, Saint-Henri de Levis, G0R 3E0]

Here are some other exanmples

141 rang du brûlé, pont rouge, G3H1B8 200 rue Commerciale, Donnacona, G3M 1W1

33 rue provost, Montreal, H8S 1L3 46 avenue Sainte-Anne, Pointe-Claire, H9S 4P8 2035 rue Victoria, Lachine, H8S 0A8 2075 rue de l'Eglise, Saint-Laurent, H4M 1G3 800 Pl Leigh-Capreol, Dorval, Montréal, H4Y 0A4

2075 rue de l'Eglise, Saint-Laurent, H4M 1G3 2035 rue Victoria, Lachine, H8S 0A8 46 ave, Sainte-Anne, Pointe-Claire, H9S 4P8 12 Charlevoix , Kirkland, H9J 2T6 930 St Germain St, Ville St-Laurent, H4L 3R9 1417 argyle , Montreal, H3G 1V5

Note: I trim the last postal code that does not have a space. Thank you in advance.

CodePudding user response：

You can use

(?<=\b[a-zA-Z]\d[a-zA-Z]\s\d[a-zA-Z]\d)\s

Or, if the space between the A1A and 1A1 is optional, and can go missing, you can use

(?<=\b[a-zA-Z]\d[a-zA-Z]\s{0,1}\d[a-zA-Z]\d)\s

This will still work since Java regex engine supports constrained width lookbehind patterns.

See the regex demo / regex demo #2. Details:

(?<=\b[a-zA-Z]\d[a-zA-Z]\s\d[a-zA-Z]\d) - a positive lookbehind that requires (immediately to the left of the current location):
- \b - a word boundary
- [a-zA-Z] - a letter
- \d - a digit
- [a-zA-Z]\s\d[a-zA-Z]\d - a letter, a whitespace, digit, letter and a digit
\s - one or more whitespaces.

The \s{0,1} matches one or zero whitespaces.

See the Java demo online:

String s = "160 Rue, Notre Dame N, Bureau 140, Sainte-Marie, G6E 3Z9 887 Chemin du Bord de l'Eau, Saint-Henri de Levis, G0R 3E0";
String regex = "(?<=\\b[a-zA-Z]\\d[a-zA-Z]\\s\\d[a-zA-Z]\\d)\\s ";
// Or
// String regex = "(?<=\\b[a-zA-Z]\\d[a-zA-Z]\\s{0,1}\\d[a-zA-Z]\\d)\\s ";
String results[] = s.split(regex);
for (String str: results) {
    System.out.println(str);
}

Output:

160 Rue, Notre Dame N, Bureau 140, Sainte-Marie, G6E 3Z9
887 Chemin du Bord de l'Eau, Saint-Henri de Levis, G0R 3E0