log file to dataframe-CodePudding

I have this type of log file:

[2022-01-01 00:01:08,111][train][info] - {"epoch":99, "data_loss":"111.013", "data_ntokens":"123.672"," data_nsentences":"2", "data_nll_loss":"2.01"} 
[2022-01-01 00:01:08,111][train][info] - {"epoch":100, "data_loss":"111.01", "data_ntokens":"123.672"," data_nsentences":"2", "data_nll_loss":"2.901"} 
[2022-01-01 00:01:08,111][train][info] - {"epoch":102, "data_loss":"222.09", "data_ntokens":"123.600"," data_nsentences":"2", "data_nll_loss":"2.1"}

I would like to get information inside the brackets, but the results' length is variable and I can not work with strings.

The dataframe that I try to get looks like this:

 ----------------------------------------------------------------------- 
| epoch | data_loss | data_ntokens | data_nsentences | data_nll_notkens |
 ----------------------------------------------------------------------- 
|  99   |  111.013  |  123.672     |      2          |   2.01           |
.....

CodePudding user response：

You can just read your log file and split the lines by the char '-', then you can build your list of dictionarys with a list comprehension and build a pandas dataframe with that list. Finally as Will Zhao says, you can use tabulate to print your dataframe in a pretty way. This is my approach:

import pandas as pd
import json
from tabulate import tabulate

with open("log_file.log", 'r') as f: 
    lines = f.readlines()
    
    new_dict = [json.loads(l.split('-')[3].strip()) for l in lines]
    df = pd.DataFrame(new_dict).set_index("epoch")
    print(tabulate(df, headers="keys", tablefmt="psql"))

Output:

 --------- ------------- ---------------- -------------------- ----------------- 
|   epoch |   data_loss |   data_ntokens |    data_nsentences |   data_nll_loss |
|--------- ------------- ---------------- -------------------- -----------------|
|      99 |     111.013 |        123.672 |                  2 |           2.01  |
|     100 |     111.01  |        123.672 |                  2 |           2.901 |
|     102 |     222.09  |        123.6   |                  2 |           2.1   |
 --------- ------------- ---------------- -------------------- -----------------

CodePudding user response：

You could use tabulate or prettytable to display a prettified output of dataframe. You could also define your own f-string format to get similar result.

Update: Add manual way to print pretty table.

test = {"epoch":99, "data_loss":"111.013", "data_ntokens":"123.672"," data_nsentences":"2", "data_nll_loss":"2.01"}
key_string = [i.center(2 len(i)) for i in test.keys()]
keys_string = "|"   "|".join(key_string)   "|"
value_string = [str(v).center(2 len(k)) for k,v in test.items()]
values_string = "|"   "|".join(value_string)   "|"
divide_string = " "   "-"*(len(keys_string)-2)   " "
print(divide_string)
print(keys_string)
print(divide_string)
print(values_string)
print(divide_string)

Output:

 --------------------------------------------------------------------- 
| epoch | data_loss | data_ntokens |  data_nsentences | data_nll_loss |
 --------------------------------------------------------------------- 
|   99  |  111.013  |   123.672    |        2         |      2.01     |
 ---------------------------------------------------------------------