Home > front end >  Java regex pattern for a multi-line string
Java regex pattern for a multi-line string

Time:01-18

I'm working with a simple java regular expression program to check whether a set of string matches a defined regular expression pattern. I have created a reg-ex pattern but it's showing false when running it. I need to modify the reg-ex pattern to match the given string. Below is my source code:

        String thread = "From: Demo Name\n"  
                "Sent: Wednesday, January 18, 2023 2:56 PM\n"  
                "To: [email protected] <[email protected]>\n"  
                "Subject: Demo Issue";
        String regEX ="((^[a-zA-Z] [:]\\s.*\\n*?\\n){2,4}. \\nSubject[:]. \\n) ?";

        Pattern pattern = Pattern.compile("((^[a-zA-Z] [:]\\s.*\\n*?\\n){2,4}. \\nSubject[:]. ?\\n) ?",
            Pattern.DOTALL | Pattern.CASE_INSENSITIVE | Pattern.MULTILINE);
        Matcher matcher = pattern.matcher(thread);
        System.out.println(matcher.find());

When running the program it returns false. But it is expected to return true. Here in the given strings, the words such as From: , Sent:, To:, and Subject: are constants and won't be changing. Need to modify the reg-ex pattern based on the need.

CodePudding user response:

^ matches start of the entire input unless you enable MULTILINE mode in which case it matches on any 'beginning of line'. So, you do want MULTILINE mode, so that your ^[a-zA-Z] :\\s.* pattern matches headers but not random usages of colon in the middle of the actual text. Note that if someone sticks 'Foo: bar' on its own line in the body you're going to match that too, not much you can do about that just with regexes alone.

You then attempt to match a thing that is supposed to consume 1 header line 2 to 4 times. Then, you need a seemingly arbitrarily injected . which will mess you up, as that means Subject can no longer be matched. Get rid of that. You also have \\n all over the place, far too often. It feels like you just shoved stuff in there praying that if only you add enough, maybe it'll work.

That's not how you make regexes. When they don't work make it smaller, not larger - try to match JUST the first line. Then expand from there. Keep going from 'matching' to 'matching' instead of starting at something that doesn't match when you feel like it should and just shoving stuff into the regex futilely.

The final trickery here is that your input string does not end in a newline, and yet you demand in your regex that the Subject: line ends with a newline. It doesn't, so that doesn't work. Using ^ and $ does work, as those match on end-of-input too.

Using that strategy I fixed for your regular expression for you:

String regEX ="((^[a-zA-Z] [:]\\s.*\\n*?\\n){2,4}. \\nSubject[:]. \\n) ?";
// use flags CASE_INSENSITIVE and MULTILINE but not DOTALL.

Don't use DOTALL - that means .* just eats everything (including the newline, which you don't want).

HOWEVER

This regex seems to be ill advised though. What are you actually trying to accomplish? If the input is 'constant', why not just ditch regexes and search for "\nSubject: " instead? If you're trying to just get rid of all headers, why not search for the double enter that separates headers from the body and eliminate the rest?

int headerSplit = in.indexOf("\n\n");
String bodyOnly = in.substring(headerSplit   2);

If you want a combination of these things, then write that. "Put it all in one gigantic regex" is rarely the way to get to easy to maintain code. If this is a full news/mail message, then first find the blank line so you can separate headers from content (after all, Foo: bar is perfectly legal to write in an email message, doesn't mean it has a Foo header!), then if you want to specifically pick up the subject, either write a regex or, you don't really need one:

void getSubjectFromEmail(String in) {
  int headerEnd = in.indexOf("\n\n");
  int subject = in.indexOf("Subject: ");
  if (headerEnd != -1 && subject > headerEnd) return null;
  int subjectEnd = in.indexOf('\n', subject);
  return in.substring(subject   "Subject: ".length(), subjectEnd == -1 ? in.length() : subjectEnd);
}

Does it without regular expressions. Regexes aren't 'good' at trying to find that 'end of headers' bit. A hybrid approach, if you prefer that:

class Test {
  private static final Pattern SUBJECT_FINDER = Pattern.compile("^Subject: (.*)$", Pattern.CASE_INSENSITIVE | Pattern.MULTILINE);

  String getSubjectFromEmail(String in) {
    int headerEnd = in.indexOf("\n\n");
    var m = SUBJECT_FINDER.matcher(in);
    if (headerEnd != -1) m.region(0, headerEnd);
    if (!m.find()) return null;
    return m.group(1);
  }

  static final String TEST_TEXT = """
    From: Demo Name
    Sent: Wednesday, January 18, 2023 2:56 PM
    To: [email protected] <[email protected]>
    Subject: Demo Issue"""
    .replace("\r", "");

  void test() {
    String subject = getSubjectFromEmail(TEST_TEXT);
    System.out.println("Subject found: "   subject);
  }

  public static void main(String[] args) {
    new Test().test();
  }
}

CodePudding user response:

Your pattern matches a newline at the end, but there is no newline at the end of the example data.

If the constants never change in the string, using \h to match a horizontal whitespace char and \R to match any unicode newline sequence:

^From:\h . \RSent:\h . \RTo:\h . \RSubject:\h .*

In Java, with Pattern.MULTILINE and Pattern.CASE_INSENSITIVE and doubled backslashes:

String regEX = "^From:\\h . \\RSent:\\h . \\RTo:\\h . \\RSubject:\\h .*";

Regex101 demo | Java demo


If you want to match 2-4 lines followed by Subject:

(?:^[a-z] :\h.*\R){2,4}Subject:.*

In Java, with Pattern.MULTILINE and Pattern.CASE_INSENSITIVE and doubled backslashes:

String regEX = "(?:^[a-z] :\\h.*\\R){2,4}Subject:.*";

Regex101 demo | Java demo

  • Related