You can check the regex101 page from here.
I have a list of adresses in different formats and non-english. Assume my list is like below.
KENNEDY CAD. SİRKECİ ARABALI VAPUR İSKELESİ FATİH/ İSTANBUL
YAVUZTÜRK MAH. KARADENİZ CAD. NO:2 ÜSKÜDAR/ İSTANBUL
HAMİDİYE MAH. ALPEREN SOK. NO:15/2 ÇEKMEKÖY/ İSTANBUL
UĞUR MUMCU MAH. YUNUS EMRE CAD. NO:25 KARTAL/ İSTANBUL
The regex I've written is as following:
(?:(?:\p{L}* M[Aa]?[Hh][. ])? *|(?:\p{L}* C[Aa]?[Dd][. ])? *)
My regex return each character as match, but i need to get 4 matches which are:
KENNEDY CAD.
YAVUZTÜRK MAH. KARADENİZ CAD.
HAMİDİYE MAH.
UĞUR MUMCU MAH. YUNUS EMRE CAD.
How can I solve that problem?
CodePudding user response:
You can use
^\p{L} (?:\s \p{L} )*\s (?:M[Aa]?[Hh]|C[Aa]?[Dd])\.?(?:\s \p{L} (?:\s \p{L} )*\s (?:M[Aa]?[Hh]|C[Aa]?[Dd]))*\.?
Details:
^
- start of string\p{L} (?:\s \p{L} )*
- a word and then zero or more whitespace separated words\s
- one or more whitespaces(?:M[Aa]?[Hh]|C[Aa]?[Dd])
-M
, an optionalA
ora
and thenh
orH
, orC
, an optionalA
ora
and thenD
ord
\.?
- an optional dot(?:\s \p{L} (?:\s \p{L} )*\s (?:M[Aa]?[Hh]|C[Aa]?[Dd]))*
- zero or more sequences of one or more whitespaces and the pattern described above\.?
- an optional dot
See the regex demo. Or, a bit less precise and efficient, but shorter:
^(?:\s*[\p{L}\s] (?:M[Aa]?[Hh]|C[Aa]?[Dd])\.?)
See this regex demo. Details:
^
- start of string(?:\s*[\p{L}\s] (?:M[Aa]?[Hh]|C[Aa]?[Dd])\.?)
- one or more sequences of\s*
- zero or more whitespaces[\p{L}\s]
- one or more letters or whitespaces(?:M[Aa]?[Hh]|C[Aa]?[Dd])
-M
, an optionalA
ora
and thenh
orH
, orC
, an optionalA
ora
and thenD
ord
\.?
- an optional dot
CodePudding user response:
Try (regex101):
^(?=.*C[Aa][Dd]\s*\.).*?C[Aa][Dd]\.|^.*?M[Aa][Hh]\s*\.
This will match all string until CAD.
or if not found until MAH.