I am parsing some text with Python and am running into an odd issue...
an example text that is being parsed:
msg:"ET WEB_SPECIFIC_APPS ClarkConnect Linux proxy.php XSS Attempt"; flow:established,to_server; content:"GET"; content:"script"; nocase; content:"/proxy.php?"; nocase; content:"url="; nocase; pcre:"//proxy.php(?|.[\x26\x3B])url=[^&;\x0D\x0A][<>"']/i"; reference:url,www.securityfocus.com/bid/37446/info; reference:url,doc.emergingthreats.net/2010602; classtype:web-application-attack; sid:2010602; rev:4; metadata:created_at 2010_07_30, updated_at 2010_07_30;
my regex:
msgSearch = re.search(r'msg:"(. )";",line)
actual result:
ET WEB_SPECIFIC_APPS ClarkConnect Linux proxy.php XSS Attempt"; flow:established,to_server; content:"GET"; content:"script"; nocase; content:"/proxy.php?"; nocase; content:"url="; nocase; pcre:"//proxy.php(?|.[\x26\x3B])url=[^&;\x0D\x0A][<>"']/i
expected result:
ET WEB_SPECIFIC_APPS ClarkConnect Linux proxy.php XSS Attempt
There are 10s of thousands of lines of text that I am parsing that are all giving me similar results. Any reason regex is picking a (seemingly) random ";
to stop at? I can fix the example above by making the regex more specific, eg. r'msg:"([\w\s\.] )";"
but other lines have different characters included. I guess I could just include every special character in my regex, but I'm trying to understand why my wildcard isn't working properly.
Any help would be appreciated!
CodePudding user response:
Try this one:
re.search(r'msg:"([^;] )";',line)
CodePudding user response:
The .
is by default "greedy", i.e. it will match as many characters as possible. In your case, it will stop at the last ";
sequence, not at the next one. To make it non-greedy (or lazy), try . ?
:
msgSearch = re.search(r'msg:"(. ?)";",line)