Home > Software engineering >  Java 11 : RegEx to Parse Log Message to CSV format
Java 11 : RegEx to Parse Log Message to CSV format

Time:04-02

I need to parse the logs and print it in csv format as below, so far I was able to extract the timestamp, message and id from the log message but I was having hard time extracting the fields using the regex, can someone please help me parsing the fields such as field_1,field_2,field_3,field_4,field_5,field_6,field_9....... from the input logs. If the input logs has more fields like field_10, field_11 those needs to be included in the csv headers and values as well, but below is the expected output for the input logs

Expected CSV output

id,timestamp,message,field_1,field_2,field_3,field_4,field_5,field_6,field_9
1,"Wed Jun 19 09:35:40 PDT 2019","test 4",,,,,"test 4",,
2,"Wed Jun 19 09:35:39 PDT 2019","test 3",,,,"test 3",,,
3,"Wed Jun 19 09:35:38 PDT 2019","test 2",,,"test 2",,,,"test 23"
4,"Wed Jun 19 09:35:37 PDT 2019","test 4",,"test 4",,,,,
5,"Wed Jun 19 09:35:37 PDT 2019","test 5",,,"test 10",,,,
6,"Wed Jun 19 09:35:40 PDT 2019","test 6","test 5",,,,,,
10,"Wed Jun 19 09:35:36 PDT 2019","test 10","test 10",,,,,"test 1"

LogParser

public class LogParser {
    public static void main(String[] args) {
        String logs = ""  
                "timestamp=\"Wed Jun 19 09:35:36 PDT 2019\" message=\"test 10\" id=10 field_1=\"test 10\" field_6=\"test 1\"\n"  
                "timestamp=\"Wed Jun 19 09:35:37 PDT 2019\" message=\"test 4\" id=4 field_2=\"test 4\"\n"  
                "timestamp=\"Wed Jun 19 09:35:38 PDT 2019\" message=\"test 2\" id=3 field_3=\"test 2\" field_9=\"test 23\"\n"  
                "timestamp=\"Wed Jun 19 09:35:39 PDT 2019\" message=\"test 3\" id=2 field_4=\"test 3\"\n"  
                "timestamp=\"Wed Jun 19 09:35:40 PDT 2019\" message=\"test 4\" id=1 field_5=\"test 4\"\n"  
                "timestamp=\"Wed Jun 19 09:35:37 PDT 2019\" message=\"test 5\" id=5 field_3=\"test 10\"\n"  
                "timestamp=\"Wed Jun 19 09:35:40 PDT 2019\" message=\"test 6\" id=6 field_1=\"test 5\"";
        final Pattern p = Pattern.compile("timestamp=(. ?) message=(. ?) id=(. ?) (. ?)");
        for (String line : logs.split("\n")) {

            Matcher m = p.matcher(line);

            String csvline = null;
            String fields = null;
            if (m.matches()) {
                csvline = (m.group(3))   ","   m.group(1)   ","   m.group(2);
                fields = m.group(4);
            }
            System.out.println(fields); // having hard time parsing the headers, values and formating

        }
    }
}

CodePudding user response:

I would use (((. ?)="(. ?)")\s*) For each line, you will have a match for each field, so you have to iterate through the matches and each match will have four groups:

  • the third group will have the field name
  • the fourth group will have the field value

Alternatively, you could modify your regex to timestamp=(. ?) message=(. ?) id=(. ?) (. ?="(. ?)")*, but I think the above is cleaner and more flexible.

CodePudding user response:

Tony BenBrahim's first regex looks like the generic regex I would use. One thing that might make it more intuitive to some developers is to change the first two parenthesis to non-capturing:

(?:(?:(. ?)="(. ?)")\s*)

Group one will contain the field name, group two will contain the field value.

If you aren't in control of the format of all the logs you will be parsing, whitespace may be allowed. I would account for whitespace in one more place where I suspect it may be allowed with:

(?:(?:(. ?)=\s*"(. ?)")\s*)

An empty value might also be permitted, so I would allow that by changing the second capture group's quantifier:

(?:(?:(. ?)=\s*"(.*?)")\s*)

The final modification that might be useful is to trim whitespace off the ends of the two captured fields (i.e. remove leading and trailing whitespace from either end of the captured results):

(?:(?:\s*(. ?)\s*=\s*"\s*(.*?)\s*")\s*)

  • Related