I am trying to match the string "06 März 2021" and I am trying with the regex:
r"(\d{1,2})\W(\p{L}{3,20})\W(\d{4})"
I tried telling the Regex to use Unicode:
RegExp(datePattern, unicode: true);
But that doesn't work for ä. It does for some other accented characters though.
Help would be appreciated. Thanks.
CodePudding user response:
Debugging showed me that the ä is treated as 2 characters, an a followed by the umlaut mark.
Because the following 2 strings are not identical (unless stackoverflow messes with the text I type):
März
März
In the first case the ä is composed from 2 characters, the a and the umlaut. In the 2nd, it's a single character. This can be checked by printing the lengths of the 2 strings (first is 5, second is 4).
After finding this link: https://www.regular-expressions.info/unicode.html#category
I realized that I needed to add the mark class of characters to the regex, so what I ended up with is:
r"(\d{1,2})\s([\p{L}\p{M}]{3,20})\s(\d{4})"
An alternative would be using canonical decomposition followed by canonical composition on the string using https://pub.dev/packages/unorm_dart
This would turn the 2nd string into the first (use single character for ä instead of 2).
NOTE: This applies to letters with umlauts, but I don't know to what other accented letters it might work for.
edit: replaced \W in the regex with \s so it only matches space characters (as suggested by The fourth bird)