Home > Software engineering >  Python regex not returning as expected
Python regex not returning as expected

Time:07-28

I am parsing some text with Python and am running into an odd issue...

an example text that is being parsed:

msg:"ET WEB_SPECIFIC_APPS ClarkConnect Linux proxy.php XSS Attempt"; flow:established,to_server; content:"GET"; content:"script"; nocase; content:"/proxy.php?"; nocase; content:"url="; nocase; pcre:"//proxy.php(?|.[\x26\x3B])url=[^&;\x0D\x0A][<>"']/i"; reference:url,www.securityfocus.com/bid/37446/info; reference:url,doc.emergingthreats.net/2010602; classtype:web-application-attack; sid:2010602; rev:4; metadata:created_at 2010_07_30, updated_at 2010_07_30;

my regex:

msgSearch = re.search(r'msg:"(. )";",line)

actual result:

ET WEB_SPECIFIC_APPS ClarkConnect Linux proxy.php XSS Attempt"; flow:established,to_server; content:"GET"; content:"script"; nocase; content:"/proxy.php?"; nocase; content:"url="; nocase; pcre:"//proxy.php(?|.[\x26\x3B])url=[^&;\x0D\x0A][<>"']/i

expected result:

ET WEB_SPECIFIC_APPS ClarkConnect Linux proxy.php XSS Attempt

There are 10s of thousands of lines of text that I am parsing that are all giving me similar results. Any reason regex is picking a (seemingly) random "; to stop at? I can fix the example above by making the regex more specific, eg. r'msg:"([\w\s\.] )";" but other lines have different characters included. I guess I could just include every special character in my regex, but I'm trying to understand why my wildcard isn't working properly.

Any help would be appreciated!

CodePudding user response:

Try this one:

re.search(r'msg:"([^;] )";',line)

CodePudding user response:

The . is by default "greedy", i.e. it will match as many characters as possible. In your case, it will stop at the last "; sequence, not at the next one. To make it non-greedy (or lazy), try . ? :

 msgSearch = re.search(r'msg:"(. ?)";",line)
  • Related