Home > Software engineering >  Need PHP regex equivalent for this python code
Need PHP regex equivalent for this python code

Time:03-07

Attempting to port python code to php and I can't seem to convert some regex to php equivalent.

RE_ON_DATE_SMB_WROTE = re.compile(
    u'(-*[>]?[ ]?({0})[ ].*({1})(.*\n){{0,2}}.*({2}):?-*)'.format(
        # Beginning of the line
        u'|'.join((
            # English
            'On',
            # French
            'Le',
            # Polish
            'W dniu',
            # Dutch
            'Op',
            # German
            'Am',
            # Portuguese
            'Em',
            # Norwegian
            u'På',
            # Swedish, Danish
            'Den',
            # Vietnamese
            u'Vào',
        )),
        # Date and sender separator
        u'|'.join((
            ',',
            u'użytkownik'
        )),
        # Ending of the line
        u'|'.join((
            # English
            'wrote', 'sent',
            # French
            u'a écrit',
            # Polish
            u'napisał',
            # Dutch
            'schreef','verzond','geschreven',
            # German
            'schrieb',
            # Portuguese
            'escreveu',
            # Norwegian, Swedish
            'skrev',
            # Vietnamese
            u'đã viết',
        ))
    ))
RE_QUOTATION = re.compile(
    r"""
    (
        (?:
            s
            |
            (?:me*){2,}
        )

        .*

        me*
    )

    [te]*$
    """, re.VERBOSE)
RE_EMPTY_QUOTATION = re.compile(
    r"""
    (
        (?:
            (?:se*) 
            |
            (?:me*){2,}
        )
    )
    e*
    """, re.VERBOSE)

Below is my attempt on the first regex for php (but it's failing on same test string)

$RE_ON_DATE_SMB_WROTE = sprintf("#(-*[>]?[ ]?(%s)[ ].*(%s)(.*\n){{0,2}}.*(%s):?-*)#u",
                            join('|', array(
                                // English
                                'On',
                                // French
                                'Le',
                                // Polish
                                'W dniu',
                                // Dutch
                                'Op',
                                // German
                                'Am',
                                // Portuguese
                                'Em',
                                // Norwegian
                                "\p{P}\p{å}",
                                // Swedish, Danish
                                'Den',
                                // Vietnamese
                                "Vào",
                            )),
                            join('|',array(
                                ',',
                                "użytkownik"
                            )),
                            join('|',array(
                                //# English
                                'wrote', 
                                'sent',
                                //# French
                                "a écrit",
                                //# Polish
                                "napisał",
                                //# Dutch
                                'schreef','verzond','geschreven',
                                //# German
                                'schrieb',
                                //# Portuguese
                                'escreveu',
                                //# Norwegian, Swedish
                                'skrev',
                                //# Vietnamese
                                "đã viết",
                            ))
                        );  

Using php regex as $result = preg_match($RE_ON_DATE_SMB_WROTE , $test_value, $found_match);

Last two regex. I can't even seem to wrap my head around.

Hopefully, someone more versed than me in both python and php can give me a hand here. :)

CodePudding user response:

You define the regex using format strings in both Python and PHP, but they support different syntax. In Python, {} is used to insert a variable into the format string, while in PHP, you use the %s to insert the string variable. Hence, { and } are special in Python format strings and need doubling when you want to insert a literal brace char. No such doubling is required in PHP.

Also, you have "\p{P}\p{å}", in PHP regex declaration while in Python you just have u'På',. I guess you want to keep the Python pattern as is.

So, here is the pattern that will work the same in Python and PHP:

$RE_ON_DATE_SMB_WROTE = sprintf("#(-*>? ?(%s) .*(%s)(.*\\n){0,2}.*(%s):?-*)#u",
                            implode('|', array(
                                // English
                                'On',
                                // French
                                'Le',
                                // Polish
                                'W dniu',
                                // Dutch
                                'Op',
                                // German
                                'Am',
                                // Portuguese
                                'Em',
                                // Norwegian
                                "På",
                                // Swedish, Danish
                                'Den',
                                // Vietnamese
                                "Vào",
                            )),
                            implode('|',array(
                                ',',
                                "użytkownik"
                            )),
                            implode('|',array(
                                //# English
                                'wrote', 
                                'sent',
                                //# French
                                "a écrit",
                                //# Polish
                                "napisał",
                                //# Dutch
                                'schreef','verzond','geschreven',
                                //# German
                                'schrieb',
                                //# Portuguese
                                'escreveu',
                                //# Norwegian, Swedish
                                'skrev',
                                //# Vietnamese
                                "đã viết",
                            ))
                        );

join is an alias of implode, I prefer implode in this context.

Note that [ ] is the same as here, [>] = >, and "\n" (string escape sequence matching an LF char) = "\\n" (a regex escape sequence matching an LF char).

Note that if you want to port the re.VERBOSE flag to PHP, you will need to use the x flag, and then you cannot use a literal whitespace inside the pattern, you will need to escape the literal whitespace, or put it into character class (yes, [ ] will make sense then).

The last two regexes do not need any special conversion, and can be written as

$RE_QUOTATION = '~((?:s|(?:me*){2,}).*me*)[te]*$~';
$RE_EMPTY_QUOTATION = '~((?:(?:se*) |(?:me*){2,}))e*~';
  • Related