PowerShell multiline comment regex works in regex101 but not in pandas or re-CodePudding

I have this regex defined in python:

multiline_comment_regex = r'(^[ \t]*<#[^>]*#>[ \t]*[\n]*)'

And the testing string:

characters = 'asdfasdfñáéíóú\n\r\t  <#somecomment \n\r multiline\t\n\r\t asd#>\nasdf\n  #comment'

In regex101.com, the regex works as a charm and matches:

'\t  <#somecomment \n\r multiline\t\n\r\t asd#>\n'

But, using pandas, it doesn't match anything:

data = pd.DataFrame({'process': [characters, ]})
data['process'].replace({multiline_comment_regex: ''}, regex=True, inplace=True)

Neither with re:

re.match(multiline_comment_regex, characters)

What is wrong with the regex?

Thank you!

CodePudding user response：

You need to account for two things here:

The ^ matches start of a whole string position, you need to use a multiline flag, and in this case, an inline (?m) option looks convenient to use
The line endings seem to be CRLF here, so you can't just use \n, it makes sense to match any whitespaces at the start/end of the pattern.

The following pattern should work:

(?m)^(\s*<#[^>]*#>\s*)

See a regex test in the environment with CRLF line endings.