Home > Enterprise >  I need to improve my Powershell Regular Expression to find Java codes for specific System.out.printl
I need to improve my Powershell Regular Expression to find Java codes for specific System.out.printl

Time:06-03

We are trying to scan through a large library of files that have html, xml, and java files that can all include Java code for System.out.println. The issue is I need to find a specific set of examples of just that part of the code.

Example 1: System.out.println("my job code is: " var.jobcode);

Example 2: System.out.println("my jc is: " var.jc);

Example 3: System.out.println("my jbc is: " var.jbc);

I have tried to get this with the following:

Get-ChildItem C:\my\folder\path -Recurse | Where-Object FullName -Match ".*C:\\my\\folder\\path*" | Where-Object FullName -Match ".*." | Select-String -Pattern '(System\.out\.println (.*?job)\/?[^)] [)]\s*;)|(System\.out\.println (.*?jc)\/?[^)] [)]\s*;)|(System\.out\.println (.*?jbc)\/?[^)] [)]\s*;){99}' -List | Select Path,Line

I got the files I wanted but I also get false positives so that files with the following lines are in the results by mistake.

System.out.println ("component printout: item"); System.out.println ("");                 <td style="word-break: break-all;word-wrap:break-word;font-size:12px;"  align="left">Job Codes</td><td style="word-break: break-all;word-wrap:break-word;font-size:12px;"  align="left">

So anytime a file has a System.out.println(); section followed by any word "job" that file gets picked up too when it shouldn't.

I have to run this over several thousand files on a semi-regular basis and need to output the file path/name and line the offending code is in.

How can I clean up this Regex to be more specific to only include files with lines like my examples above but not pickup the other files?

CodePudding user response:

Some notes about the pattern that you tried:

  • You have 3 alternations, where the only difference is the word that should be present. You can use a single pattern with an alternation for those words in a non capture group instead
  • Using println matches printl followed by 1 or more times an n char
  • The non greedy dot .*? can possibly over match, as the dot can also match " and )
  • The quantifier {99} repeats the whole grouping part exactly 99 times for the last alternation which seems a bit off in the pattern.

You might make the pattern a bit more specific:

System\.out\.println\("[^":]*\s(?:job|jb?c)\s[^":]*:[^"]*"[^)]*\);

Explanation

  • System\.out\.println\( Match System.out.println(
  • "[^":]* Match " and then optional chars other than " and :
  • \s(?:job|jb?c)\s Match either job jbc or jc between whitespace chars (Or use word boundaries \b(?:job|jb?c)\b)
  • [^":]*:[^"]*" Optionally match any char other than " and :, then match : followed by any char except "
  • [^)]*\); Match optional chars other than ), then match ) and ;

See a regex demo.

An alternative without a mandatory : and word boundaries:

System\.out\.println\("[^":]*\b(?:job|jb?c)\b[^"]*"[^)]*\);

See another regex demo.

  • Related