Attempting to port python code to php and I can't seem to convert some regex to php equivalent.
RE_ON_DATE_SMB_WROTE = re.compile(
u'(-*[>]?[ ]?({0})[ ].*({1})(.*\n){{0,2}}.*({2}):?-*)'.format(
# Beginning of the line
u'|'.join((
# English
'On',
# French
'Le',
# Polish
'W dniu',
# Dutch
'Op',
# German
'Am',
# Portuguese
'Em',
# Norwegian
u'På',
# Swedish, Danish
'Den',
# Vietnamese
u'Vào',
)),
# Date and sender separator
u'|'.join((
',',
u'użytkownik'
)),
# Ending of the line
u'|'.join((
# English
'wrote', 'sent',
# French
u'a écrit',
# Polish
u'napisał',
# Dutch
'schreef','verzond','geschreven',
# German
'schrieb',
# Portuguese
'escreveu',
# Norwegian, Swedish
'skrev',
# Vietnamese
u'đã viết',
))
))
RE_QUOTATION = re.compile(
r"""
(
(?:
s
|
(?:me*){2,}
)
.*
me*
)
[te]*$
""", re.VERBOSE)
RE_EMPTY_QUOTATION = re.compile(
r"""
(
(?:
(?:se*)
|
(?:me*){2,}
)
)
e*
""", re.VERBOSE)
Below is my attempt on the first regex for php (but it's failing on same test string)
$RE_ON_DATE_SMB_WROTE = sprintf("#(-*[>]?[ ]?(%s)[ ].*(%s)(.*\n){{0,2}}.*(%s):?-*)#u",
join('|', array(
// English
'On',
// French
'Le',
// Polish
'W dniu',
// Dutch
'Op',
// German
'Am',
// Portuguese
'Em',
// Norwegian
"\p{P}\p{å}",
// Swedish, Danish
'Den',
// Vietnamese
"Vào",
)),
join('|',array(
',',
"użytkownik"
)),
join('|',array(
//# English
'wrote',
'sent',
//# French
"a écrit",
//# Polish
"napisał",
//# Dutch
'schreef','verzond','geschreven',
//# German
'schrieb',
//# Portuguese
'escreveu',
//# Norwegian, Swedish
'skrev',
//# Vietnamese
"đã viết",
))
);
Using php regex as
$result = preg_match($RE_ON_DATE_SMB_WROTE , $test_value, $found_match);
Last two regex. I can't even seem to wrap my head around.
Hopefully, someone more versed than me in both python and php can give me a hand here. :)
CodePudding user response:
You define the regex using format strings in both Python and PHP, but they support different syntax. In Python, {}
is used to insert a variable into the format string, while in PHP, you use the %s
to insert the string variable. Hence, {
and }
are special in Python format strings and need doubling when you want to insert a literal brace char. No such doubling is required in PHP.
Also, you have "\p{P}\p{å}",
in PHP regex declaration while in Python you just have u'På',
. I guess you want to keep the Python pattern as is.
So, here is the pattern that will work the same in Python and PHP:
$RE_ON_DATE_SMB_WROTE = sprintf("#(-*>? ?(%s) .*(%s)(.*\\n){0,2}.*(%s):?-*)#u",
implode('|', array(
// English
'On',
// French
'Le',
// Polish
'W dniu',
// Dutch
'Op',
// German
'Am',
// Portuguese
'Em',
// Norwegian
"På",
// Swedish, Danish
'Den',
// Vietnamese
"Vào",
)),
implode('|',array(
',',
"użytkownik"
)),
implode('|',array(
//# English
'wrote',
'sent',
//# French
"a écrit",
//# Polish
"napisał",
//# Dutch
'schreef','verzond','geschreven',
//# German
'schrieb',
//# Portuguese
'escreveu',
//# Norwegian, Swedish
'skrev',
//# Vietnamese
"đã viết",
))
);
join
is an alias of implode
, I prefer implode
in this context.
Note that [ ]
is the same as
here, [>]
= >
, and "\n"
(string escape sequence matching an LF char) = "\\n"
(a regex escape sequence matching an LF char).
Note that if you want to port the re.VERBOSE
flag to PHP, you will need to use the x
flag, and then you cannot use a literal whitespace inside the pattern, you will need to escape the literal whitespace, or put it into character class (yes, [ ]
will make sense then).
The last two regexes do not need any special conversion, and can be written as
$RE_QUOTATION = '~((?:s|(?:me*){2,}).*me*)[te]*$~';
$RE_EMPTY_QUOTATION = '~((?:(?:se*) |(?:me*){2,}))e*~';