Home > OS >  How to parse log file to csv file in Java
How to parse log file to csv file in Java

Time:06-12

I want read my log file and put into csv file via Java. How would I parse the log file with these delimiters into csv file as below.

.log file:

2022-06-01 11:00:00 wt.nm=aa&wt.ti=t1&
2022-06-02 12:00:00 wt.nm=ab&wt.ti=t2&
2022-06-03 10:00:00 wt.nm=ac&wt.ti=t3&

date and time is separated by space, name and title separated by wt.nm=/wt.ti with & as end

CSV output:

date,time,name,title 
2022-06-01,11:00:00,aa,t1
2022-06-02,12:00:00,ab,t2
2022-06-03,10:00:00,ac,t3
import java.io.*;
public class test {
    public static void main(String[] args) {
        try{
            BufferedReader in = new BufferedReader(new FileReader("/Users/ts/Desktop/test/src/0606.log"));
            FileWriter wb = new FileWriter("/Users/ts/Desktop/testcsv.csv");
            String str;
            while((str=in.readLine()) != null) {
                System.out.println(str);
                wb.write(str);
            }
        }
        catch (IOException e) {
        }
    }
}

CodePudding user response:

Following will do the transformation.

while ((str = in.readLine()) != null) {
    str = str.replaceAll(" wt.nm=|&wt.ti=| ", ",").replace("&", "");
    System.out.println(str);
    wb.write(str);
}

CodePudding user response:

The format in your data (ignoring the leading date and time) looks like:

  • wt.nm is always present
  • wt.ti is always present
  • wt.nm always appears before wt.ti
  • there are never additional name value pairs (beyond wt.nm and wt.ti)

However, the data has a more general pattern to it:

  • one (or more) name value pairs
  • for each name value pair, the name is separated from the value by =
  • each name value pair is separated from other pairs by &

Also, it's easy to imagine encountering additional name value pairs (who says it will always be those same two forever? why not five name values in a single line?), or perhaps the ordering changes (could wt.ti show up before wt.nm?).

The code below takes a general approach, working with the general pattern of your input data. I included some sample data to show how it works.

  • the first input line – "2022-06-01 11:00:00 wt.nm=aa&wt.ti=t1&" – one of your original inputs
  • the second line – "2022-06-01 11:00:00 xxxxxxxxxxx=bb&" – uses a different "name" altogether, and only has a single name value pair (not two): xxxxxxxxxxx=bb
  • the third line – "2022-06-01 11:00:00 xxxxxxxxxxx=cc&yyyyyyyyyyyy=t3&zzzzzzzz=99&" – includes a third name value pair: zzzzzzzz=99. who says there will always be just two name value pairs?
String[] lines = {
        "2022-06-01 11:00:00 wt.nm=aa&wt.ti=t1&",
        "2022-06-01 11:00:00 xxxxxxxxxxx=bb&",
        "2022-06-01 11:00:00 xxxxxxxxxxx=cc&yyyyyyyyyyyy=t3&zzzzzzzz=99&"
};
for (String line : lines) {
    System.out.println("original: "   line);
    String edit1 = line.replaceAll("&", " ");
    System.out.println("          "   edit1);

    StringTokenizer tokenizer = new StringTokenizer(edit1);
    StringBuilder finalLine = new StringBuilder();
    while (tokenizer.hasMoreTokens()) {
        String token = tokenizer.nextToken();
        System.out.println("   token: "   token);
        if (token.contains("=")) {
            int positionOfEqualsSign = token.indexOf("=");
            String value = token.substring(positionOfEqualsSign   1);
            finalLine.append(value);
        } else {
            finalLine.append(token);
        }
        if (tokenizer.hasMoreTokens()) {
            finalLine.append(",");
        }
    }
    System.out.println("   final: "   finalLine);
    System.out.println();
}

Here's the output from that code, which includes a lot of extra output in order to be easier to follow:

original: 2022-06-01 11:00:00 wt.nm=aa&wt.ti=t1&
          2022-06-01 11:00:00 wt.nm=aa wt.ti=t1 
   token: 2022-06-01
   token: 11:00:00
   token: wt.nm=aa
   token: wt.ti=t1
   final: 2022-06-01,11:00:00,aa,t1

original: 2022-06-01 11:00:00 xxxxxxxxxxx=bb&
          2022-06-01 11:00:00 xxxxxxxxxxx=bb 
   token: 2022-06-01
   token: 11:00:00
   token: xxxxxxxxxxx=bb
   final: 2022-06-01,11:00:00,bb

original: 2022-06-01 11:00:00 xxxxxxxxxxx=cc&yyyyyyyyyyyy=t3&zzzzzzzz=99&
          2022-06-01 11:00:00 xxxxxxxxxxx=cc yyyyyyyyyyyy=t3 zzzzzzzz=99 
   token: 2022-06-01
   token: 11:00:00
   token: xxxxxxxxxxx=cc
   token: yyyyyyyyyyyy=t3
   token: zzzzzzzz=99
   final: 2022-06-01,11:00:00,cc,t3,99
  • Related