Home > Software design >  Regular Expression to match addresses: problem with matching addresses with different structures
Regular Expression to match addresses: problem with matching addresses with different structures

Time:10-21

I am using regular expressions in order to match different part of street addresses (number of street, street, city...). So far, everything works except for the City, depending on the structure of the address:

Some addresses I have in my data are finishing with only a City, such as: "Paris" and some other are finishing with a City, a comma and a country, following the structure: "Paris, France"
I have found the regular expression to match everything but the end of the address (city country), so I would like to match the city properly.

I cannot just match the first word as some cities are composed by more than one word (example: Saint-Jean-Port-Joli).

Here is what I have tried to match the city:

(\\w.*,|\\w.*$)

Unfortunately, this gives me: "Paris" for the addresses finishing by "Paris" and "Paris," for the addresses finishing by "Paris, France"

How should I do ?

Thank you for your help, Tim

CodePudding user response:

This is pretty simple if your regex flavor supports lookaheads:

^. ?(?=(, [\w\s] )?)$

I added the \s so countries like Burkina Faso would parse correctly. Note that if your string has multiple commas this will include up to the last one.

CodePudding user response:

Match all characters that are not commas:

^[^,] 

See live demo.

This matches everything up to, but not including, the first comma or to the end, whichever comes first.

This also works for city names with various characters in them eg L'Haÿ-les-Roses, France

  • Related